Can ingredients based forecasting be learned? Disentangling a random forest's severe weather predictions
Data files
May 06, 2024 version files 25.39 GB
Abstract
Machine learning (ML)-based models have been rapidly integrated into forecast practices across the weather forecasting community in recent years. While ML tools introduce additional data to forecasting operations, there is a need for explainability to be available alongside the model output, such that the guidance can be transparent and trustworthy for the forecaster. This work makes use of the algorithm tree interpreter (TI) to disaggregate the contributions of meteorological features used in the Colorado State University Machine Learning Probabilities (CSU-MLP) system, a random forest-based ML tool that produces real-time probabilistic forecasts for severe weather using inputs from the Global Ensemble Forecast System v12. TI feature contributions are analyzed in time and space for CSU-MLP day-2 and 3 individual hazard (tornado, wind, and hail) forecasts and day-4 aggregate severe forecasts over a 2-yr period. For individual forecast periods, this work demonstrates that feature contributions derived from TI can be interpreted in an ingredients-based sense, effectively making the CSU-MLP probabilities physically interpretable. When investigated in an aggregate sense, TI illustrates that the CSU-MLP system's predictions use meteorological inputs in ways that are consistent with the spatiotemporal patterns seen in meteorological fields that pertain to severe storms climatology. This work concludes with a discussion on how these insights could be beneficial for model development, real-time forecast operations, and retrospective event analysis.
README: Data for: Can Ingredients-Based Forecasting be Learned? Disentangling a Random Forest's Severe Weather Predictions
https://doi.org/10.5061/dryad.0rxwdbs7w
Data for: "Can Ingredients-Based Forecasting be Learned? Disentangling a Random Forest's Severe Weather Predictions"
Mazurek, Alexandra C., Aaron J. Hill, Russ S. Schumacher, and Hanna J. McDaniel: "Can Ingredients-Based Forecasting be Learned? Disentangling a Random Forest's Severe Weather Predictions", Weather and Forecasting.
Day 2, 3, and 4 forecasts from the machine learning-based prediction system detailed in the associated manuscript (cited above) as well as those from the Storm Prediction Center (SPC) and observations (local storm reports) of severe thunderstorm hazards are included in this dataset. Forecasts, outlooks, and observations for each forecast day (day 2, day 3, day 4) are contained in a single netCDF file. For the day 2 and 3 forecasts, the netCDF files contain three separate machine learning-based forecasts for tornado, wind, and hail hazards; the day 4 files contain one forecast for "any severe" hazard (tornado, hail or wind). The feature contribution file contain the tree interpreter-derived contributions to the machine learning-based forecasts that are in the aforementioned netCDF files. The tree interpreter approach is described in the associated manuscript. Similar to the forecast files, there are separate netCDF files for each of the forecast lead times and hazard types (7 total; day 2 hail, wind, and tornado, day 3 hail, wind, and tornado, and day 4 any severe), with each containing the contributions associated with the forecasts for each of the respective lead times.
Description of the data and file structure
Forecast data
There are 4 netCDF files containing the forecast data for the day 2, day 3 (two files, see forecast data notes below for details) and day 4 forecasts:
- csu_mlp_2021_tor_day2.nc
- csu_mlp_2021_tor_day3.nc
- csu_mlp_2021_severe_day3.nc
- csu_mlp_2021_severe_day4.nc
The forecast netCDF files are structured as follows (example of the day 2 forecasts: csu_mlp_2021_tor_day2.nc):
dimensions:
lat = 56 ;
lon = 139 ;
time = 920 ;
coordinates:
float lat(lat) ;
float lon(lon) ;
datetime64 time(time);
time: units = hours ;
time: format = '%Y-%m-%d%H' ;
time: timezone = 'UTC' ;
time: calendar = "gregorian" ;
variables:
CSU-MLP day 2 tornado machine learning-based forecasts
float csu_mlp_2021_tor_day2(time, lat, lon) ;
csu_mlp_2021_tor_day2:grid_type = "Latitude/longitude" ;
csu_mlp_2021_tor_day2:initial_time = "10/04/2020 (12:00)" ;
csu_mlp_2021_tor_day2:first_init = "20201004" ;
csu_mlp_2021_tor_day2:valid_day_end = "20230412" ;
csu_mlp_2021_tor_day2:version = "csu_mlp_2021_tor_day2" ;
SPC day 2 tornado outlook probabilities
float day2otlk_netcdf_torn_fine_single_only(time, lat, lon) ;
day2otlk_netcdf_torn_fine_single_only:first_init = "20201004" ;
day2otlk_netcdf_torn_fine_single_only:valid_day_end = "20230412" ;
day2otlk_netcdf_torn_fine_single_only:version = "day2otlk_netcdf_torn_fine_single_only" ;
gridded hail reports
float hail_gridded(time, lat, lon) ;
hail_gridded:first_init = "20201004" ;
hail_gridded:valid_day_end = "20230412" ;
gridded wind reports
float wind_gridded(time, lat, lon) ;
wind_gridded:first_init = "20201004" ;
wind_gridded:valid_day_end = "20230412" ;
gridded tornado reports
float tor_gridded(time, lat, lon) ;
tor_gridded:first_init = "20201004" ;
tor_gridded:valid_day_end = "20230412" ;
CSU-MLP day 2 wind machine learning-based forecasts
float csu_mlp_2021_wind_day2(time, lat, lon) ;
csu_mlp_2021_wind_day2:grid_type = "Latitude/longitude" ;
csu_mlp_2021_wind_day2:initial_time = "10/04/2020 (12:00)" ;
csu_mlp_2021_wind_day2:first_init = "20201004" ;
csu_mlp_2021_wind_day2:valid_day_end = "20230412" ;
csu_mlp_2021_wind_day2:version = "csu_mlp_2021_wind_day2" ;
SPC day 2 wind outlook probabilities
float day2otlk_netcdf_wind_fine_single_only(time, lat, lon) ;
day2otlk_netcdf_wind_fine_single_only:first_init = "20201004" ;
day2otlk_netcdf_wind_fine_single_only:valid_day_end = "20230412" ;
day2otlk_netcdf_wind_fine_single_only:version = "day2otlk_netcdf_wind_fine_single_only" ;
CSU-MLP day 2 hail machine learning-based forecasts
float csu_mlp_2021_hail_day2(time, lat, lon) ;
csu_mlp_2021_hail_day2:grid_type = "Latitude/longitude" ;
csu_mlp_2021_hail_day2:initial_time = "10/04/2020 (12:00)" ;
csu_mlp_2021_hail_day2:first_init = "20201004" ;
csu_mlp_2021_hail_day2:valid_day_end = "20230412" ;
csu_mlp_2021_hail_day2:version = "csu_mlp_2021_hail_day2" ;
SPC day 2 hail outlook probabilities
float day2otlk_netcdf_hail_fine_single_only(time, lat, lon) ;
day2otlk_netcdf_hail_fine_single_only:first_init = "20201004" ;
day2otlk_netcdf_hail_fine_single_only:valid_day_end = "20230412" ;
day2otlk_netcdf_hail_fine_single_only:version = "day2otlk_netcdf_hail_fine_single_only" ;
Forecast Data Notes
All date/times represent the end date/time that the 24-h forecast, outlook, or report period is valid for.
For example, the data for the date '2021-03-20 12:00' in the file would correspond to:
- a day 2 CSU-MLP forecast initialized with GEFS model data that was initialized 2021-03-18 00:00 UTC
- a day 2 SPC outlook issued 2021-03-18
- reports occurring between 2021-03-19 12:00 UTC to 2021-03-20 12:00 UTC
All CSU-MLP forecasts in this dataset are initialized with data from 0000 UTC run of the operational Global Ensemble Forecast System (GEFS).
The 0600 UTC SPC convective outlook is used in the dataset for day 2 period, and the 0730 UTC outlook is used for the day 3 period.
For the day 4 forecasts (in the file titled "csu_mlp_2021_severe_day4.nc"), there are not individual forecasts generated for tornadoes, wind, and hail by the CSU-MLP system; only one set of probabilities for "any severe" hazard (i.e., tornado, wind or hail) are generated. SPC also does not issue forecasts for individual severe hazards at this lead time (only one forecast for "any severe" hazard). The variable names for the CSU-MLP forecasts and SPC forecasts at this lead time are 'csu_mlp_2021_severe_day4' and 'day4otlk_netcdf_prob_fine_single_only' respectively. Gridded reports for tornadoes, wind, and hail are still included as separate variables with the same variable names as the day 2 forecasts.
The forecast file containing the day 3 CSU-MLP forecasts for the individual tornado, wind, and hail forecasts ("csu_mlp_2021_tor_day3.nc") does not contain fields for SPC forecasts, as there are no SPC probabilities issued for individual hazards at this lead time (only "any severe"). The day 3 SPC outlooks for "any severe" can be found in the file titled "csu_mlp_2021_severe_day3.nc". This file also contains CSU-MLP forecasts for "any severe" hazard (variable name 'csu_mlp_2021_severe_day3', not analyzed in this study).
Tree Interpreter Feature Contribution files
There are 7 netCDF files containing the feature contributions for the CSU-MLP machine learning-based forecasts. There is one file for each the contributions corresponding to each forecast hazard and lead time used in the study:
- day2h_TIcontributions_2021_to_2022.nc
- day2t_TIcontributions_2021_to_2022.nc
- day2w_TIcontributions_2021_to_2022.nc
- day3h_TIcontributions_2021_to_2022.nc
- day3t_TIcontributions_2021_to_2022.nc
- day3w_TIcontributions_2021_to_2022.nc
- day4_TIcontributions_2021_to_2022.nc
The feature contribution files are structured in the same way for all forecast hazard types and lead times. The general structure is as follows:
dimensions:
lat = 56 ;
lon = 139 ;
vars = 15 ; (note this dimension is 12 in the feature contributions for the day 4 forecasts only)
hours = 9 ;
init_date = 727 ;
coordinates:
float lat(lat) ;
float lon(lon) ;
object vars(vars);
int hours(hours) ;
int cats() ;
datetime64 init_date(init_date) ;
init_date: units = hours ;
init_date: format = '%Y-%m-%d%H' ;
init_date: timezone = 'UTC' ;
init_date: calendar = "gregorian" ;
variables:
float contributions(init_date, lat, lon, vars, hours) ;
contributions:grid_type = "Latitude/longitude" ;
contributions:initial_time = "01/01/2021 (00:00)" ;
contributions:first_init = "20210101" ;
contributions:valid_day_end = "20221231" ;
Feature Contributions Data Notes
All date/times (init_date coordinate) represent the initialization date/time for the 24-h forecast that the feature contributions are associated with. For example, the feature contributions data for the init_date '2021-03-20 00:00' in the file would correspond to a day 2 CSU-MLP forecast that is initialized with data from 2021-03-20 00:00 UTC and is valid between 2021-03-21 12:00 UTC to 2021-03-22 12:00 UTC.
There are missing feature contributions data for the following dates: 2021-02-04, 2021-03-04, and 2021-12-12.
The coordinate "vars" is short for variable. These are the environmental variables that are considered in the CSU-MLP machine learning-based forecasts. Variable names are abbreviated in the dataset, and the full description of each variable can be found in Table 1 of the manuscript associated with this dataset.
The coordinate "hours" represent feature contributions at 3-hour timestamps within the 24-h forecast period. For example, for a day 2 forecast, the feature contributions at hour "0" would correspond to forecast hour 36 (which would be 1200 UTC at the start of the forecast period), hour "1" would correspond to forecast hour 39 (1500 UTC), hour "2" would correspond to forecast hour 42 (1800 UTC)... to hour "8", which would correspond to forecast hour 60 (1200 UTC at the end of the forecast period).
The coordinate "cats" is short for categories. This is a placeholder variable that is an artifact of the dataset being parsed down for the sake of reducing filesize. The CSU-MLP model system makes three types of forecasts for each hazard/lead time: 0=no severe, 1=severe, and 2=significant severe. Only the feature contributions for the "severe" forecasts (category 1) are included in these files. Feature contributions associated with forecasts of the no severe and significant severe categories can be provided upon request.
Sharing/Access information
Data was derived from the following sources:
- SPC outlooks are available via a public archive at https://www.spc.noaa.gov/.
- Severe weather reports are available from the Severe Weather Database at https://www.spc.noaa.gov/wcm/
Methods
Forecast data: These data include publically available local storm reports (from NOAA), publically available Storm Prediction Center (SPC) outlooks, and forecasts generated from the machine learning prediction system detailed in the manuscript. The local storm reports were retrieved from an online public-facing archive and gridded to NCEP grid 4. The SPC outlooks were originally in a shapefile format and ArcGIS was used to convert the shapefiles to a netCDF format. Then, the netCDF gridded SPC outlooks were regridded to NCEP grid 4 to conduct verification with local storm reports. Lastly, the machine learning-based forecasts are generated on the NCEP grid. Each of these datasets are then combined in a 'master' netCDF file for each forecast leadtime examined in the study (day 2, day 3, and day 4) for easy compression and storage. The master netCDF files additionally have metadata associated with the latitude and longitude points of the grid and forecast day strings. Forecasts span October 2020 through April 2023.
Feature contributions: Feature contributions were calculated from the machine learning forecasts described above using the treeinterpreter package for python. For each forecast day for a given lead time and hazard type (tornado, wind, hail, severe), feature contributions are calculated for all environmental predictors in the dataset (~6,600). For each grid point, the feature contributions are summed according to the spatial neighborhood described in the methods of this manuscript for dimensionality reduction purposes. Thus, for a given forecast, the contributions have dimensions of environmental variable, forecast hour, latitude, and longitude. TI contributions corresponding to two years of machine learning forecasts (2021-2022) are combined into single netCDF files for each forecast hazard and lead time (i.e. 7 total files, day 2 tornado, wind, and hail, day 3 tornado, wind, and hail, and day 4 "any severe").
More details on the methods surrounding each of these datasets can be found in the methods section of the manuscript associated with this work.