Predicting with Power Outage Dataset

by Hae In Lee

What are we predicting? We can essentially predict anything we want to know about this dataset !

Before we dive into predicting anything, let’s try to familiarize ourselves with the dataset using the following question:

Which cause category is most responsible for power outages in different states?

Introduction

What is this dataset about?

This dataset comes from Purdue University’s LASCI Research Data on Power Outages in the US, by state, from January 2000 to July 2016. It provides information on the geographical locations of these events, regional climate data, land-use patterns, electricity consumption trends, and the economic characteristics of the states impacted by these outages.

Why is this dataset even worth looking at? And how does it help answer our question?

The dataset is important for identifying trends in power outages and could help improve strategies for preventing and mitigating future disruptions.

This question helps understand the causes of power outages can help states and utility companies make effective choices when power outages occur. They can focus on improving infrastructure in areas most affected by specific causes like strengthening power grids in regions prone to weather-related outages. They can create a comprehensive emergency response planning, since different situations may require different strategies. They can also plan resource allocation by helping states identify which outages are likely to have the most significant impact on the population.

General overview of the dataset

In total, there are 1534 rows and 55 columns (variables). For our analysis purposes, we will only be looking at a couple of the variables

This chart shows the columns for analysis

Columns from Power Outage Dataset	Description
YEAR	Indicates the year when the outage event occurred
U.S._STATE	Represents all the states in the continental U.S.
CAUSE.CATEGORY	Categories of all the events causing the major power outages
OUTAGE.DURATION	Duration of outage events (in minutes)
CUSTOMERS.AFFECTED	Number of customers affected by the power outage event
TOTAL.CUSTOMERS	Annual number of total customers served in the U.S. state
POPULATION	Population in the U.S. state in a year

Dataset Cleaning & Analysis

To use the data from the columns, we first need to clean them up.

First, I dropped all the columns I will not be using for the analysis. I kept only seven columns, YEAR, U.S._STATE, OUTAGE.DURATION, CAUSE.CATEGORY, CUSTOMERS.AFFECTED, TOTAL.CUSTOMERS, POPULATION

YEAR	U.S._STATE	OUTAGE.DURATION	CAUSE.CATEGORY	CUSTOMERS.AFFECTED	TOTAL.CUSTOMERS	POPULATION
2011	Minnesota	3060	severe weather	70000	2.5957e+06	5.34812e+06
2014	Minnesota	1	intentional attack	nan	2.64074e+06	5.45712e+06
2010	Minnesota	3000	severe weather	70000	2.5869e+06	5.3109e+06
2012	Minnesota	2550	severe weather	68200	2.60681e+06	5.38044e+06
2015	Minnesota	1740	severe weather	250000	2.67353e+06	5.48959e+06

Univariant Analysis

Refers to the analysis of one variable
Focuses on describing and summarizing the data.
Helps understand the distribution, central tendency, variability, and shape of the data

This bar chart visualizes the distribution of power outage causes by category, showing the number of outages for each cause.

The x-axis represents the cause categories of power outages
the y-axis represents the count of outages within each category, providing a direct comparison of how often each cause contributes to power outages

This map shows how power outages, in terms of customers affected, are distributed across the U.S. states. It identifies regions where the most significant disruptions occur (e.g., high-impact states are shaded darker).

Bivariant Analysis

Refers to the analysis of the relationship between two variables.
Determines if one variable can predict or explain the behavior of another variable.

This bar chart shows how changes in the cause of outages (across years) influence the total number of customers affected. Each bar represents a year, with the different cause categories (e.g., Weather, Human Error, Equipment Failure) stacked next to each other to show the proportion of customers affected by each cause.

The x-axis represents Years
The y-axis represents the total number of customers affected by outages

Interesting Aggregates

Remember our initial question : Which cause category is most responsible for power outages in different states?

To help us answer this after seeing the visualizations of our data, I thought it would be helpful to group ‘CAUSE.CATEGORY’ and ‘U.S._STATES’ columns, sum the ‘CUSTOMERS.AFFECTED,’ to see how many customers were affected by outages in each state, broken down by the cause of the outage.

CAUSE.CATEGORY	U.S._STATE	CUSTOMERS.AFFECTED
equipment failure	AK	14273
equipment failure	AR	0
equipment failure	AZ	167000
equipment failure	CA	1.39026e+06
equipment failure	DE	18400

Then, I pivoted the table to better analyze the relationship between the cause categories (e.g., Weather, Human Error, Equipment Failure) and the U.S. states, showing how each cause affects the number of customers in each state.

U.S._STATE	equipment failure	fuel supply emergency	intentional attack	islanding	public appeal	severe weather	system operability disruption
AK	14273	nan	nan	nan	nan	nan	nan
AL	nan	nan	0	nan	nan	471644	nan
AR	0	nan	9200	0	54094	556466	nan
AZ	167000	nan	2713	nan	nan	180911	229000
CA	1.39026e+06	0	127920	131019	0	2.05794e+07	3.34489e+06

Imputation & Missing Values

I did not conduct any imputation on the missing values since the ‘NaN’ values represent missing data in the dataset, which could be due to various reasons like no reported outages for specific causes, irrelevance of certain causes for particular states, or gaps in the data collection process. However, having said that, either filling them with zeros or mean or median values could be valid depending on different analysis needs and goals.

# Prediction Problem Now that we are acquainted with our dataset thanks to our initial question, “Which cause category is most responsible for power outages in different states?”

### Prediction question “Can we predict the number of customers affected based on the cause and duration of the outage?”

This question is a regression prediction problem.

Baseline Model

Because my prediction problem is a regression problem, I started with using a linear regression base model with Mean Absolute Error, Root Mean Squared Error, and R-squared metrics. Linear Regression assumes a linear relationship between the features and the target and the metrics gives model’s predictive performance and its ability to explain the variance in the target variable.

Target Variable: ‘CUSTOMERS.AFFECTED’

Two Features:

CAUSE.CATEGORY (Nominal feature)
OUTAGE.DURATION (Quantitative feature)

Note:

There are total of 1534 rows to work with however, I will be dropping the NaN values from the 3 columns I am working with, ‘CUSTOMERS.AFFECTED’, ‘OUTAGE.DURATION’, ‘CAUSE.CATEGORY’
I will be dropping the rows because do not know what the right answers to these NaN values of the columns will be. Also, we are trying to predict the ‘CUSTOMERS.AFFECTED’; therefore, there is no point in imputing the NaN values. Trying to impute the NaN rows will just lead us to inaccurate and misleading predictions.
Also, the NaN values of ‘OUTAGE.DURATION’ will also be dropped and not imputed because a lot of those data simply do not have any information across all columns to draw imputations from, meaning, that those rows were purposely not filled out by the dataset creators.
I will be working with a total of 1052 rows of data
I have set my code to train on 80% of the rows (844 rows) and test on 20% of the rows (212rows)

Baseline Model Metric Results:

Mean Absolute Error (MAE): 129084.54289691892
Root Mean Squared Error (RMSE): 302647.37444542814
R-squared (R^2): 0.13119692403417982

These performance results are BAD. Mean Absolute Error shows that the predictions are off by about 129,084 customers. This is a significant amount, suggesting that the model isn’t providing accurate predictions.

Root Mean Squared Error shows the model is making large errors in some cases.

R-squared shows that the model explains only about 13.1% of the variance in the target variable, which is quite low, suggesting that the model is not capturing much of the underlying patterns in the data and is not very predictive.

Some of the reaons why I think the results turned out so poorly is that maybe the variables have weak correlations and they may have non-linear relationships.

Final Model

Since my results from my baseline model was so poor, I decided to make my own columns using the given data to make a stronger and effective impact.

Newly Added Features

The first feature I created was ‘OUTAGE_SEVERITY’ = ‘OUTAGE.DURATION’ * ‘CUSTOMERS.AFFECTED’
- I thought this was an important feature to add since it covers and combines both the duration of the outage time and how many people it affected
- These two factors are crucial to predicting the total impact of an outage
The second feature I created was ‘POP_DENSITY’ = (‘POPDEN_URBAN’ + ‘POPDEN_RURAL’ + ‘POPDEN_UC’) / 3
- I thought this was an important feature to add since it helps model how the location and concentration of people affects the number of customers impacted

Modeling Algorithm

Random Forest Regression!
According to my Baseline Model result, it showed that the target and feature variables seemed to have non-linear relationships
Random forest is one of the algorithms that can capture non-linear relationships
Random forest works with both numerical and categorical features
Less likely to overfit or underfit when there are multiple trees -> less noise
Works well with GridSearchCV -> optimize hyperparameters
According to GeeksforGeeks “Hyperparameter tuning is the process of selecting the optimal values for a machine learning model’s hyperparameters.”

Hyperparameter, Best Parameter, GridSearchCV

n_estimators = [50, 100, 200]:

Testing multiple values of n_estimators helps find the optimal number of trees
50 trees - Faster training and less computationally expensive
100 trees - Default starting point and provides a good balance for most use cases (according to Google)
200 trees - More accurate but computationally expensive. More trees usually make the model more robust, as the model has more trees to get predictions from

max_depth = [100, 10, 20]:

three different maximum depths for the trees to see which tree depth strikes the right balance
100 means trees can grow 100 levels deep -> detailed patterns and interactions between features, risks overfitting
10 means it captures general patterns up to maximum depth of 10, which is relatively shallow. Risks underfitting
20 means a tree with a maximum depth of 20. This helps it not overfit like depth 100 but not underfit like depth 10

max_features: sqrt

Using the square root of the number of features lead to a good model performance
Helps to reduce overfitting

Final Model Metric Results:

Mean Absolute Error (MAE): 63804.31157813248
Root Mean Squared Error (RMSE): 174916.45915436902
R-squared (R^2): 0.7097923321184068

These graphs compare the metric results of Baseline and Final Models.

Conclusion

Based on the final evaluation metrics, there was a huge improvement from the baseline model evaluation metrics. This means that the final model can reasonably predict the number of customers affected by an outage, but it is not perfect. The R^2 score suggests a good fit, but, as always, there is obviously room for improvement. The next steps would require exploring and finding different ways of reducting the large prediction errors, shown by the high value of Root Mean Squared Error.