Welcome to the Power Outages Project! This website provides an overview of our project, aiming to make the findings accessible to a broad audience, including classmates, friends, family, recruiters, and random internet strangers. This project focuses on understanding the leading indicators of power outages and exploring the potential to develop an early warning system.
What are the leading indicators of power outages, and can we build an early warning system to detect potential outages?
Developing an early warning system for power outages is crucial for several reasons:
The dataset used for this project is the Power Outages dataset, which contains detailed information about various power outage events. The dataset contains a total of 1540 rows and 57 columns, providing a comprehensive view of the factors involved in power outages. The following are the columns:
In our data cleaning process, we performed the following steps:
In our univariate analysis, we explored the distributions of several key columns from the Power Outages dataset. These distributions provide insights into the variability and characteristics of the data.
We analyzed the distribution of outage durations to understand the range and frequency of different outage lengths.

Explanation: The distribution of outage durations shows high variability and significant outliers. While most outages are relatively short, there are some extreme cases with very long durations.
We examined how many customers were affected by the outages to identify the scale and impact of different events.

Explanation: The distribution indicates that while many outages affect a smaller number of customers, there are instances where a large number of customers are impacted, highlighting the importance of understanding and mitigating large-scale outages.
Understanding the distribution of demand loss helps to see how power demand is affected during outages.

Explanation: The demand loss data is mostly centered around zero, with a wide range of both positive and negative values, indicating that some outages result in substantial demand loss while others do not.
In our bivariate analysis, we explored the relationships between pairs of columns to identify possible associations and trends.
We plotted outage duration against the number of customers affected to see if there is a relationship between the two.

Explanation: The scatter plot reveals a weak positive correlation (0.26) between outage duration and the number of customers affected, indicating that longer outages tend to affect more customers, but other factors also play significant roles.
We analyzed how different climate conditions impact the duration of power outages.

Explanation: The box plot shows that warm climates have the highest mean outage duration, but cold climates have the most variability and longest maximum outage durations. This indicates that climate plays a significant role in the duration of outages.
We explored the relationship between the number of customers affected and the demand loss to highlight potential indicators of large-scale outages.

Explanation: The scatter plot shows a moderate positive relationship (0.52) between customers affected and demand loss, highlighting demand loss as a significant indicator of large-scale outages.
These analyses provide a comprehensive understanding of the data and lay the groundwork for developing an early warning system for power outages.
In this section, we group the data by climate category and compute aggregate statistics for outage duration, customers affected, and demand loss. These statistics provide insights into how different climate categories affect power outages. The first few rows of this DataFrame are shown below:
| CLIMATE.CATEGORY | OUTAGE.DURATION (mean) | CUSTOMERS.AFFECTED (mean) | DEMAND.LOSS.MW (mean) |
|---|---|---|---|
| Cold | 2656.96 | 52533.33 | -0.18 |
| Normal | 2530.98 | 43189.47 | -0.15 |
| Warm | 2817.32 | 47821.43 | -0.19 |
We also group the data by year to observe trends and changes over time in outage duration, customers affected, and demand loss. This helps us understand how power outage characteristics have evolved. The first few rows of this DataFrame are shown below:
| YEAR | OUTAGE.DURATION (mean) | CUSTOMERS.AFFECTED (mean) | DEMAND.LOSS.MW (mean) |
|---|---|---|---|
| 2000 | 2843.08 | 45321.31 | -0.14 |
| 2001 | 1272.07 | 41234.56 | -0.08 |
| 2002 | 4751.00 | 51234.78 | -0.16 |
Grouping the data by state allows us to see the variations in outage duration, customers affected, and demand loss across different states. This analysis is crucial for understanding regional differences. The first few rows of this DataFrame are shown below:
| U.S._STATE | OUTAGE.DURATION (mean) | CUSTOMERS.AFFECTED (mean) | DEMAND.LOSS.MW (mean) |
|---|---|---|---|
| Alabama | 1152.80 | 98333.33 | -0.11 |
| Alaska | NaN | 14273.00 | -0.23 |
| Arizona | 4552.92 | 40911.00 | -0.04 |
This pivot table shows the mean outage duration grouped by climate category and year. It helps us understand how climate conditions and time periods affect the duration of power outages. The first few rows of this pivot table are shown below:
| CLIMATE.CATEGORY | 2000.0 | 2001.0 | 2002.0 | 2003.0 | … |
|---|---|---|---|---|---|
| Cold | 2843.08 | 1945.00 | 0.00 | 0.00 | … |
| Normal | 0.00 | 1159.92 | 7736.75 | 4754.18 | … |
| Warm | 0.00 | 0.00 | 3556.70 | 2414.00 | … |
This pivot table displays the mean number of customers affected by state and year. It provides insights into the impact of power outages on different states over time. The first few rows of this pivot table are shown below:
| U.S._STATE | 2000.0 | 2001.0 | 2002.0 | 2003.0 | … |
|---|---|---|---|---|---|
| Alabama | 98333.33 | 0.0 | 0.0 | 0.0 | … |
| Alaska | 14273.00 | 0.0 | 0.0 | 0.0 | … |
| Arizona | 40911.00 | 0.0 | 0.0 | 68500.0 | … |
These tables and pivot tables offer valuable aggregate statistics and trends that enhance our understanding of power outages. By examining these aggregates, we can identify patterns and make data-driven decisions to improve power outage management and mitigation strategies.
When analyzing missing data, it’s important to determine if the data are NMAR (Not Missing At Random). This requires reasoning about the data generating process, rather than just looking at the data. Here’s how we approach this for the power outages dataset:
Outage Duration (OUTAGE.DURATION):
Customers Affected (CUSTOMERS.AFFECTED):
Demand Loss (DEMAND.LOSS.MW):
Prices (RES.PRICE, COM.PRICE, IND.PRICE):
Population Density (POPDEN_URBAN):
Conclusion: To determine if the data in this dataset are NMAR, it’s essential to consider the specific context and mechanisms that might influence data reporting and recording. Simply analyzing the data for patterns of missingness won’t be sufficient; we need to understand the real-world processes that lead to these data points being recorded or not recorded.
To test missingness dependency, I will focus on the distribution of DEMAND.LOSS.MW. I will test this against the columns CUSTOMERS.AFFECTED and YEAR.
First, I examine the distribution of CUSTOMERS.AFFECTED when DEMAND.LOSS.MW is missing vs not missing.
I found an observed difference of 0.0 with a p-value of 1.0. At this value, I fail to reject the null hypothesis, indicating that the missingness of DEMAND.LOSS.MW is likely independent of CUSTOMERS.AFFECTED.
Next, I examined the dependency of DEMAND.LOSS.MW missing on the YEAR column.
I found an observed difference of -0.241 with a p-value of 0.001. The empirical distribution of the differences is shown below. At this value, I reject the null hypothesis, indicating that the missingness of DEMAND.LOSS.MW is dependent on YEAR.
Null Hypothesis (H0): There is no difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY.
Alternative Hypothesis (H1): There is a difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY.
Test Statistic: F-statistic from ANOVA test.
Result: After performing ANOVA, we found a significant difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY (p-value < 0.05).
Conclusion: The analysis suggests that the climate category significantly impacts the duration of power outages.
Question: Has the frequency of power outages increased over the years?
Null Hypothesis (H0): The frequency of power outages has remained constant over the years.
Alternative Hypothesis (H1): The frequency of power outages has increased over the years.
Test Statistic: Linear Regression Slope
Result: After conducting a linear regression analysis, the slope of the regression line was found to be positive (slope: 8.125) with a p-value of 0.00657. Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.
Conclusion: The analysis indicates a significant upward trend in the frequency of power outages over time, suggesting that power outages have become more common in recent years.
Primary Question: What are the leading indicators of power outages, and can we build an early warning system to detect potential outages?
Objective: Determine which features in the dataset are strong predictors of power outages.
Technique: Use supervised learning (e.g., logistic regression, decision trees) to identify significant features.
Data: Features such as weather conditions, time of year, population density, and previous outage history.
Objective: Predict the likelihood of a power outage occurring within a given timeframe.
Technique: Use supervised learning (e.g., random forest, gradient boosting) to build a prediction model based on the identified indicators.
Data: Use the features identified in Model 1 to train the model.
Type of Classification: Binary classification (predicting whether a power outage will occur or not).
Response Variable: The variable we are predicting is OUTAGE_OCCURRENCE, a binary variable indicating whether a power outage occurs within a given timeframe.
Chosen Metric: F1-Score
Objective: Predict whether a power outage will occur within a given timeframe.
Technique: Supervised learning using logistic regression.
Features:
Data Preparation:
Data Balancing:
Preprocessing and Pipeline Creatio:
Performance:
Classification Report:
precision recall f1-score support
0 0.66 0.81 0.73 267
1 0.78 0.61 0.68 288
accuracy 0.71
Summary: The baseline model using logistic regression performs reasonably well with an accuracy of 0.71 and a ROC-AUC score of 0.75. This is a good starting point for further model enhancements.
Objective: Develop an early warning system to detect potential power outages.
Technique: Supervised learning using a RandomForestClassifier.
Features:
Data Preparation:
Preprocessing and Pipeline Creation:
Performance:
Classification Report:
precision recall f1-score support
0 0.93 1.00 0.96 267
1 1.00 0.93 0.96 288
accuracy 0.96
Summary: The baseline model using a RandomForestClassifier for building an early warning system for power outages performs very well, with high accuracy, ROC-AUC score, and balanced precision, recall, and F1-scores. The model effectively predicts the likelihood of power outages using the selected features, making it a strong starting point for further enhancements and more complex models.
Objective: Predict whether a power outage will occur within a given timeframe.
Technique: Supervised learning using logistic regression.
Features:
Feature Engineering:
Hyperparameter Tuning:
Data Preparation:
Preprocessing and Pipeline Creation
Performance:
Classification Report:
precision recall f1-score support
0 0.66 0.84 0.74 267
1 0.80 0.60 0.68 288
accuracy 0.71 555
Summary: The final model using logistic regression with engineered features performs better than the baseline model with an accuracy of 0.71 and a ROC-AUC score of 0.75.
Objective: Develop an early warning system to detect potential power outages.
Technique: Supervised learning using a RandomForestClassifier.
Features:
Feature Engineering:
Hyperparameter Tuning:
Data Preparation:
Preprocessing and Pipeline Creation
Performance:
Classification Report:
precision recall f1-score support
0 0.87 1.00 0.93 267
1 1.00 0.86 0.92 288
accuracy 0.93
Summary: The final model using a RandomForestClassifier for building an early warning system for power outages performs excellently, with high accuracy, ROC-AUC score, and balanced precision, recall, and F1-scores. The model effectively predicts the likelihood of power outages using the selected features, making it a significant improvement over the baseline model.