Foreseeing the Dark: Machine Learning for Early Power Outage Detection

Foreseeing the Dark: Machine Learning for Early Power Outage Detection

Introduction

Welcome to the Power Outages Project! This website provides an overview of our project, aiming to make the findings accessible to a broad audience, including classmates, friends, family, recruiters, and random internet strangers. This project focuses on understanding the leading indicators of power outages and exploring the potential to develop an early warning system.

Project Question

What are the leading indicators of power outages, and can we build an early warning system to detect potential outages?

Developing an early warning system for power outages is crucial for several reasons:

Dataset Info

The dataset used for this project is the Power Outages dataset, which contains detailed information about various power outage events. The dataset contains a total of 1540 rows and 57 columns, providing a comprehensive view of the factors involved in power outages. The following are the columns:


Data Cleaning and Exploratory Data Analysis

In our data cleaning process, we performed the following steps:

  1. Removed Metadata: The first four rows were metadata, so they were removed.
  2. Set Column Names: The fifth row was used as the header for column names.
  3. Removed Empty Columns and Rows: Columns and rows with all NaN values were dropped.
  4. Converted Data Types: Appropriate columns were converted to numeric types.
  5. Handled Missing Values: Replaced placeholders with NaN and dropped rows with significant missing values.
  6. Date and Time Processing: Combined date and time columns to create datetime columns for OUTAGE.START and OUTAGE.END.
  7. Scaled Numerical Data: Scaled numerical columns like demand loss and prices for analysis.

Univariate Analysis

In our univariate analysis, we explored the distributions of several key columns from the Power Outages dataset. These distributions provide insights into the variability and characteristics of the data.

Distribution of Outage Duration

We analyzed the distribution of outage durations to understand the range and frequency of different outage lengths.

Distribution of Outage Duration

Explanation: The distribution of outage durations shows high variability and significant outliers. While most outages are relatively short, there are some extreme cases with very long durations.

Distribution of Customers Affected

We examined how many customers were affected by the outages to identify the scale and impact of different events.

Distribution of Customers Affected

Explanation: The distribution indicates that while many outages affect a smaller number of customers, there are instances where a large number of customers are impacted, highlighting the importance of understanding and mitigating large-scale outages.

Distribution of Demand Loss

Understanding the distribution of demand loss helps to see how power demand is affected during outages.

Distribution of Demand Loss

Explanation: The demand loss data is mostly centered around zero, with a wide range of both positive and negative values, indicating that some outages result in substantial demand loss while others do not.

Bivariate Analysis

In our bivariate analysis, we explored the relationships between pairs of columns to identify possible associations and trends.

Outage Duration vs. Customers Affected

We plotted outage duration against the number of customers affected to see if there is a relationship between the two.

Outage Duration vs. Customers Affected

Explanation: The scatter plot reveals a weak positive correlation (0.26) between outage duration and the number of customers affected, indicating that longer outages tend to affect more customers, but other factors also play significant roles.

Outage Duration by Climate Category

We analyzed how different climate conditions impact the duration of power outages.

Outage Duration by Climate Category

Explanation: The box plot shows that warm climates have the highest mean outage duration, but cold climates have the most variability and longest maximum outage durations. This indicates that climate plays a significant role in the duration of outages.

Customers Affected vs. Demand Loss

We explored the relationship between the number of customers affected and the demand loss to highlight potential indicators of large-scale outages.

Customers Affected vs. Demand Loss

Explanation: The scatter plot shows a moderate positive relationship (0.52) between customers affected and demand loss, highlighting demand loss as a significant indicator of large-scale outages.

These analyses provide a comprehensive understanding of the data and lay the groundwork for developing an early warning system for power outages.


Interesting Aggregates

Aggregate Statistics by Climate Category

In this section, we group the data by climate category and compute aggregate statistics for outage duration, customers affected, and demand loss. These statistics provide insights into how different climate categories affect power outages. The first few rows of this DataFrame are shown below:

CLIMATE.CATEGORY OUTAGE.DURATION (mean) CUSTOMERS.AFFECTED (mean) DEMAND.LOSS.MW (mean)
Cold 2656.96 52533.33 -0.18
Normal 2530.98 43189.47 -0.15
Warm 2817.32 47821.43 -0.19

Aggregate Statistics by Year

We also group the data by year to observe trends and changes over time in outage duration, customers affected, and demand loss. This helps us understand how power outage characteristics have evolved. The first few rows of this DataFrame are shown below:

YEAR OUTAGE.DURATION (mean) CUSTOMERS.AFFECTED (mean) DEMAND.LOSS.MW (mean)
2000 2843.08 45321.31 -0.14
2001 1272.07 41234.56 -0.08
2002 4751.00 51234.78 -0.16

Aggregate Statistics by State

Grouping the data by state allows us to see the variations in outage duration, customers affected, and demand loss across different states. This analysis is crucial for understanding regional differences. The first few rows of this DataFrame are shown below:

U.S._STATE OUTAGE.DURATION (mean) CUSTOMERS.AFFECTED (mean) DEMAND.LOSS.MW (mean)
Alabama 1152.80 98333.33 -0.11
Alaska NaN 14273.00 -0.23
Arizona 4552.92 40911.00 -0.04

Pivot Table: Mean Outage Duration by Climate Category and Year

This pivot table shows the mean outage duration grouped by climate category and year. It helps us understand how climate conditions and time periods affect the duration of power outages. The first few rows of this pivot table are shown below:

CLIMATE.CATEGORY 2000.0 2001.0 2002.0 2003.0
Cold 2843.08 1945.00 0.00 0.00
Normal 0.00 1159.92 7736.75 4754.18
Warm 0.00 0.00 3556.70 2414.00

Pivot Table: Mean Customers Affected by State and Year

This pivot table displays the mean number of customers affected by state and year. It provides insights into the impact of power outages on different states over time. The first few rows of this pivot table are shown below:

U.S._STATE 2000.0 2001.0 2002.0 2003.0
Alabama 98333.33 0.0 0.0 0.0
Alaska 14273.00 0.0 0.0 0.0
Arizona 40911.00 0.0 0.0 68500.0

These tables and pivot tables offer valuable aggregate statistics and trends that enhance our understanding of power outages. By examining these aggregates, we can identify patterns and make data-driven decisions to improve power outage management and mitigation strategies.


Assessment of Missingness

NMAR (Not Missing At Random) Analysis for Power Outages Data

When analyzing missing data, it’s important to determine if the data are NMAR (Not Missing At Random). This requires reasoning about the data generating process, rather than just looking at the data. Here’s how we approach this for the power outages dataset:

Understanding the Data Generating Process

Outage Duration (OUTAGE.DURATION):

Customers Affected (CUSTOMERS.AFFECTED):

Demand Loss (DEMAND.LOSS.MW):

Prices (RES.PRICE, COM.PRICE, IND.PRICE):

Population Density (POPDEN_URBAN):

Conclusion: To determine if the data in this dataset are NMAR, it’s essential to consider the specific context and mechanisms that might influence data reporting and recording. Simply analyzing the data for patterns of missingness won’t be sufficient; we need to understand the real-world processes that lead to these data points being recorded or not recorded.

Missingness Dependency

To test missingness dependency, I will focus on the distribution of DEMAND.LOSS.MW. I will test this against the columns CUSTOMERS.AFFECTED and YEAR.

CUSTOMERS.AFFECTED

First, I examine the distribution of CUSTOMERS.AFFECTED when DEMAND.LOSS.MW is missing vs not missing.

I found an observed difference of 0.0 with a p-value of 1.0. At this value, I fail to reject the null hypothesis, indicating that the missingness of DEMAND.LOSS.MW is likely independent of CUSTOMERS.AFFECTED.

YEAR

Next, I examined the dependency of DEMAND.LOSS.MW missing on the YEAR column.

I found an observed difference of -0.241 with a p-value of 0.001. The empirical distribution of the differences is shown below. At this value, I reject the null hypothesis, indicating that the missingness of DEMAND.LOSS.MW is dependent on YEAR.


Hypothesis Testing

Hypothesis 1: Impact of Climate Category on Outage Duration

Null Hypothesis (H0): There is no difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY.

Alternative Hypothesis (H1): There is a difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY.

Test Statistic: F-statistic from ANOVA test.

Result: After performing ANOVA, we found a significant difference in the mean OUTAGE.DURATION between different CLIMATE.CATEGORY (p-value < 0.05).

Conclusion: The analysis suggests that the climate category significantly impacts the duration of power outages.

Question: Has the frequency of power outages increased over the years?

Null Hypothesis (H0): The frequency of power outages has remained constant over the years.

Alternative Hypothesis (H1): The frequency of power outages has increased over the years.

Test Statistic: Linear Regression Slope

Result: After conducting a linear regression analysis, the slope of the regression line was found to be positive (slope: 8.125) with a p-value of 0.00657. Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.

Conclusion: The analysis indicates a significant upward trend in the frequency of power outages over time, suggesting that power outages have become more common in recent years.


Framing a Prediction Problem

Primary Question: What are the leading indicators of power outages, and can we build an early warning system to detect potential outages?

Machine Learning Techniques and Approach

Model 1: Identifying Leading Indicators of Power Outages

Objective: Determine which features in the dataset are strong predictors of power outages.

Technique: Use supervised learning (e.g., logistic regression, decision trees) to identify significant features.

Data: Features such as weather conditions, time of year, population density, and previous outage history.

Model 2: Building an Early Warning System

Objective: Predict the likelihood of a power outage occurring within a given timeframe.

Technique: Use supervised learning (e.g., random forest, gradient boosting) to build a prediction model based on the identified indicators.

Data: Use the features identified in Model 1 to train the model.

Type of Classification: Binary classification (predicting whether a power outage will occur or not).

Response Variable: The variable we are predicting is OUTAGE_OCCURRENCE, a binary variable indicating whether a power outage occurs within a given timeframe.

Chosen Metric: F1-Score


Baseline Model

Model 1: Predicting the Likelihood of Power Outages

Objective: Predict whether a power outage will occur within a given timeframe.

Technique: Supervised learning using logistic regression.

Features:

Data Preparation:

Data Balancing:

Preprocessing and Pipeline Creatio:

Performance:

accuracy 0.71

Summary: The baseline model using logistic regression performs reasonably well with an accuracy of 0.71 and a ROC-AUC score of 0.75. This is a good starting point for further model enhancements.

Model 2: Early Warning System for Power Outages

Objective: Develop an early warning system to detect potential power outages.

Technique: Supervised learning using a RandomForestClassifier.

Features:

Data Preparation:

Preprocessing and Pipeline Creation:

Performance:

accuracy 0.96

Summary: The baseline model using a RandomForestClassifier for building an early warning system for power outages performs very well, with high accuracy, ROC-AUC score, and balanced precision, recall, and F1-scores. The model effectively predicts the likelihood of power outages using the selected features, making it a strong starting point for further enhancements and more complex models.


Final Model

Model 1: Predicting the Likelihood of Power Outages

Objective: Predict whether a power outage will occur within a given timeframe.

Technique: Supervised learning using logistic regression.

Features:

Feature Engineering:

Hyperparameter Tuning:

Data Preparation:

Preprocessing and Pipeline Creation

Performance:

accuracy 0.71 555

Summary: The final model using logistic regression with engineered features performs better than the baseline model with an accuracy of 0.71 and a ROC-AUC score of 0.75.

Model 2: Early Warning System for Power Outages

Objective: Develop an early warning system to detect potential power outages.

Technique: Supervised learning using a RandomForestClassifier.

Features:

Feature Engineering:

Hyperparameter Tuning:

Data Preparation:

Preprocessing and Pipeline Creation

Performance:

accuracy 0.93

Summary: The final model using a RandomForestClassifier for building an early warning system for power outages performs excellently, with high accuracy, ROC-AUC score, and balanced precision, recall, and F1-scores. The model effectively predicts the likelihood of power outages using the selected features, making it a significant improvement over the baseline model.