US federal resource allocations are inconsistent with concentrations of energy poverty
Abstract
Recent data from the United States (US) Energy Information Administration reveals that nearly one in three households in the US report experiencing energy poverty, and this number is only expected to rise. Federal assistance programs exist, but allocations across states have been nearly static since 1984, while the distribution of energy poverty is dynamic in location and time. We produce a novel machine learning approach based on sociodemographic and geographical information to estimate energy burden in each US census tract for 2015 and 2020. Our analysis confirms that average household energy burdens increased, and the range of households suffering energy poverty broadened. We provide an optimized allocation structure to urge policy makers to revise the distribution of funds to better match assistance needs.
README: US federal resource allocations are inconsistent with concentrations of energy poverty
https://doi.org/10.5061/dryad.9kd51c5rj
This dataset contains the necessary R scripts and data files to replicate this analysis' results. All analysis is completed in R, and an internet connection is required as the RECS input files are loaded directly from the US Energy Information Administration's webiste for the most up-to-date information.
Description of the data and file structure
“Analysis” Folder
The folder titled "Analysis" contains all of the results presented in this paper. The "Coeffs" subfolder conatins the .csv files of model coefficients for both 2015 and 2020.
- 2015_coeffs.csv
- 2020_coeffs.csv
The "Figures" subfolder contains all of the maps, graphs, and performance output from the R scripts.
- Graphs: Histograms of tract average energy burdens for 2015, 2020, and the comparison of 2015 and 2020. Subfolder "2020" also contains the output of determining the assistance funding required for reaching different levels of maximum energy burden across the contiguous US.
- Maps: Contains all map outputs from the 2015 and 2020 machine learning models.
- Model_performance: Contains all output of model performance from the model performance scripts, described below in the Code/Software section.
“Data Files” Folder
The folder titled "Data Files" contains the raw input data and temporary data that is used between steps in the scripts.
Subfolder: 2015
- 2015/ACS: Subfolder contains all the output .rds files from the 2015 ACS Wrangling R code.
- 2015/LIHEAP: LIHEAP_DATA_DOWNLOAD.csv contains the 2015 LIHEAP actual allocations. Column titled “Total Funding” is used for the funding to each state.
- 2015/RECS Input: Subfolder containing all of the output .rds files from the 2015 RECS Wrangling code.
Subfolder: 2020
- 2020/ACS: Subfolder contains all the output .rds files from the 2020 ACS Wrangling R code.
- 2020/LIHEAP: 2020_LIHEAP_DATA_DOWNLOAD.csv contains the 2020 LIHEAP actual allocations. Column titled “Total Funding” is used for the funding to each state. 2020_liheap_state_min_max.csv contains the state minimum and maximum benefit amounts for recipients used in the optimized allocation analysis.
- 2020/RECS Input: Subfolder containing all of the output .rds files from the 2020 RECS Wrangling code.
Subfolder: Census_changes
Subfodler containing the relationship between census tracts between 2010 (used in 2015) and 2020, downloaded form the US Census Bureau. 2020_relationship_2010.txt file contains the downloaded data from the US Census Bureau. difference.RDS contains the output of the difference between 2015 and 2020 energy burdens derived in the 2020_Adaptive_LASSO.Rmd code.
Subfolder: Raw Data
- Raw Data/IECC zones: climate_zones.csv contains the IECC Climate Zone data sourced from https://gist.github.com/philngo/d3e251040569dba67942#file-climate_zones-csv that is used in both the 2015 and 2020 Adaptive-LASSO R code.
- Raw Data/Normals: Subfolder contains the cooling degree day (CDD) and heating degree day (HDD) for each county in the US in both 2015 and 2020. 2015_cdd_county.csv and 2015_hdd_county.csv are the files used in 2015_ACS_Data_Wrangling.Rmd. 2020_cdd_county.csv and 2020_hdd_county.csv are the files used in 2020_ACS_Data_Wrangling.Rmd.
- Raw Data/ZIP_TRACT: Subfolder containing the mapping between zip codes and census tracts for both 2015 and 2020. These data are used in the 2015 and 2020 ACS Data Wrangling notebooks, sourced from https://www.huduser.gov/portal/datasets/usps_crosswalk.html.
Subfolder: Temp Data
Subfolder containing all intermediate .rds files used in between scripts as they are run in order. These files are all automatically updated when code is run and are all output from the code that is used an input in a later step in a different R notebook.
Sharing/Access information
Data was derived from the following sources:
- US Energy Information Administration's (EIA) Residential Energy Consumption Survey (RECS)
- US Census Bureau's 5-year American Community Survey (ACS)
- US EIA's State Energy Data System (SEDS)
- US Department of Commerce's National Oceanic and Atmospheric Administration (NOAA)
- Pacific Northwest National Laboratory's International Energy Conservation Codes (IECC)
Code/Software
R is required to run all scripts for this analysis. The scripts were created using R version 4.2.2; R-Studio Version 2023.12.1+402 (2023.12.1+402); Mac OS Sonoma 14.4. The seed for replication is 062023 (included as a variable in all of the R scripts already).Annotations are provided throughout all of the scripts with loading relevant data sources, cleaning them, and performing analysis.
Packages used: "tidycensus", "tidyverse", "sjmisc", "fastDummies", "sf", "units", "foreign", "glmnet", "reshape2", "survey", "haven", "glinternet", "sampleSelection", "data.table", "dplyr", "dbplyr", "tidyr", "stringr", "magrittr", "readr", "srvyr", "ggplot2", "rvest", "caret", "randomForest", "glmtlp", "zipcodeR", "rgdal", "rjson", "readxl", "writexl", "xtable", "openxlsx", "spatstat"
Users should run scripts in each folder in order: 1. RECS Wranging, then 2. ACS Wrangling, followed by 3. Machine Learning. Scripts in each folder for both 2015 and 2020 should be run before moving onto the next folder. The user will need to update the working directory in each script to match the directory in which they have downloaded the dataset. This can be done in all instances where the setwd() function is used. The code, as uploaded, should then be able to be run in succession without any additional steps.
1. RECS Wrangling
This folder contains the R notebook files for loading the Residential Energy Consumption Survey (RECS) data for both 2015 and 2020 (two separate R notebooks), cleaning these data, and preparing them for input into the machine learning model.
- 2015_RECS_Data_Wrangling.Rmd: R notebook for data input and data cleaning for 2015 RECS in preparation for Adaptive-LASSO analysis in 3. Machine Learning folder.
- 2020_RECS_Data_Wrangling.Rmd: R notebook for data input and data cleaning for 2020 RECS in preparation for Adaptive-LASSO analysis in 3. Machine Learning folder.
2. ACS Wrangling
This folder contains the R notebook files for loading the American Community Survey (ACS) data for both 2015 and 2020 (two separate R notebooks), cleaning these data, and preparing them for input into the machine learning model.
- 2015_ACS_Data_Wrangling.Rmd: R notebook for data input and data cleaning for 2015 ACS data in preparation for Adaptive-LASSo analysis in 3. Machine Learning folder. The ACS data is pulled using the tidycensus package, and missing data is supplemented by pulling directly from the sources above.
- 2020_ACS_Data_Wrangling.Rmd: R notebook for data input and data cleaning for 2020 ACS data in preparation for Adaptive-LASSo analysis in 3. Machine Learning folder. The ACS data is pulled using the tidycensus package, and missing data is supplemented by pulling directly from the sources above.
3. Machine Learning
This folder contains the R notebook files for performing the Adaptive-LASSO machine learning models. It also contains a subfolder, titled "Model Performances", that contains the two R notebook files for testing different machine learning models for their performance and provides the data used in selected the Adaptive-LASSO machine learning model.
- 2015_Adaptive_LASSO.Rmd: R notebook for completing the Adaptive-LASSO machine learning model to produce the census-tract energy burden estimates for 2015.
- 2020_Adaptive_LASSO.Rmd: R notebook for completing the Adaptive-LASSo machine learning model to produce the census-tract energy burden estimates for 2020. This notebook also completes the optimized funding allocation process described in the paper.
- Subfolder "Model Performances":
- 2015_model_performance.Rmd: R notebook for running Ridge, LASSO, and Adaptive-LASSO models to test their performance with 2015 RECS data inputs.
- 2020_model_performance.Rmd: R notebook for running Ridge, LASSO, and Adaptive-LASSO models to test their performance with 2015 RECS data inputs.
Methods
We use machine learning to determine how various demographic and physical characteristics are correlated with household energy burdens across the US. Energy burden estimates allow us to identify where energy poverty may be concentrated at the census-tract level. Our analysis extends and improves upon the Low-income Energy Affordability Data (LEAD) tool, developed by the US Department of Energy’s National Renewable Energy Laboratory to estimate energy expenditures and burdens in several ways (28). The LEAD tool is designed to help local and state governments with decisions for addressing energy poverty; however, it is static in time and uses self-reported energy expenditures given only for one month of the year, which is not reported publicly. The reliance on one month implies that the estimation of annual values is not guaranteed to account for the seasonal variation in energy costs throughout the months. The sampling done by the survey must sufficiently cover all months of the year, and this is not verifiable from the publicly available data. In addition, which month is used varies across respondents. Different from LEAD, we use household-level sociodemographic and geographic data, detailed in the following subsection, from the Energy Information Administration’s (EIA) Residential Energy Consumption Survey (RECS) to estimate the annual energy burden. This survey is completed every five years, enabling us to track changes in energy burden over time. To develop our projections at a census-tract level, we use an adaptive least absolute shrinkage and selection operator (LASSO) technique to select important variables from the RECS data to be applied to census-tract level information from the US Census Bureau’s American Community Survey (ACS).