Skip to main content

Data for: A new paradigm for medium-range severe weather forecasts: Probabilistic random forest-based predictions

Cite this dataset

Hill, Aaron J.; Schumacher, Russ S.; Jirak, Israel L. (2023). Data for: A new paradigm for medium-range severe weather forecasts: Probabilistic random forest-based predictions [Dataset]. Dryad.


Historical observations of severe weather and simulated severe weather environments (i.e., features) from the Global Ensemble Forecast System v12 (GEFSv12) Reforecast Dataset (GEFS/R) are used in conjunction to train and test random forest (RF) machine learning (ML) models to probabilistically forecast severe weather out to days 4–8. RFs are trained with ~9 years of the GEFS/R and severe weather reports to establish statistical relationships. Feature engineering is briefly explored to examine alternative methods for gathering features around observed events, including simplifying features using spatial averaging and increasing the GEFS/R ensemble size with time-lagging. Validated RF models are tested with ~1.5 years of real-time forecast output from the operational GEFSv12 ensemble and are evaluated alongside expert human-generated outlooks from the Storm Prediction Center (SPC). Both RF-based forecasts and SPC outlooks are skillful with respect to climatology at days 4 and 5 with diminishing skill thereafter. The RF-based forecasts exhibit tendencies to slightly underforecast severe weather events, but they tend to be well-calibrated at lower probability thresholds. Spatially averaging predictors during RF training allows for prior-day thermodynamic and kinematic environments to generate skillful forecasts, while time-lagging acts to expand the forecast areas, increasing resolution but decreasing overall skill. The results highlight the utility of ML-generated products to aid SPC forecast operations into the medium range.


These data include publically available local storm reports (from NOAA), publically available Storm Prediction Center (SPC) outlooks, and forecasts generated from the machine learning prediction system detailed in the manuscript. The local storm reports were retrieved from an online public-facing archive and gridded to NCEP grid 4. The SPC outlooks were originally in a shapefile format and ArcGIS was used to convert the shapefiles to a netCDF format. Then, the netCDF gridded SPC outlooks were regridded to NCEP grid 4 to conduct verification with local storm reports. Lastly, the machine learning-based forecasts are generated on the NCEP grid. Each of these datasets are then combined in a 'master' netCDF file for easy compression and storage. The master netCDF files additionally have metadata associated with the latitude and longitude points of the grid and forecast day strings.   

Usage notes

NetCDF files can be opened and viewed with open source programs (e.g., Python, NCO utilities).


NOAA, Award: NA20OAR4590350