Data from: Observation definitions and their implications in machine learning-based predictions of excessive rainfall

Hill, Aaron 1 ; Schumacher, Russ2 ; Green, Mitchell3

Published Oct 07, 2024 on Dryad. https://doi.org/10.5061/dryad.kwh70rzdx

Data files

Oct 07, 2024 version files 35.53 GB

day1_csu_mlp_20201005_20231003.nc
6.02 GB
day1_exps_20201005_20231003.nc
16.35 GB
day2_csu_mlp_20201005_20231003.nc
6.07 GB
day3_csu_mlp_20201005_20231003.nc
6.06 GB
FF_1yr_24h_training.nc
167.08 MB
FF_2yr_24h_training.nc
167.08 MB
README.md
6.12 KB
ufvs_climo_daily_smoothed_20161001_20220930.nc
582.68 MB
UFVS_training.nc
106.76 MB

Abstract

The implications of definitions of excessive rainfall observations on machine learning-model forecast skill is assessed using the Colorado State University Machine Learning Probabilities (CSU-MLP) forecast system. The CSU-MLP uses historical observations along with reforecasts from a global ensemble to train random forests to probabilistically predict excessive rainfall events. Here, random forest models are trained using two distinct rainfall datasets, one that is composed of fixed-frequency (FF) average recurrence intervals exceedances and flash flood reports, and the other a compilation of flooding and rainfall proxies (Unified Flood Verification System; UFVS). Both models generate 1-3 day forecasts and are evaluated against a climatological baseline to characterize their overall skill as a function of lead time, season, and region. Model comparisons suggest that regional frequencies in excessive rainfall observations contribute to when and where the ML models issue forecasts, and subsequently their skill and reliability. Additionally, the spatio-temporal distribution of observations have implications for ML model training requirements, notably, how long of an observational record is needed to obtain skillful forecasts. Experiments reveal that shorter-trained UFVS-based models can be as skillful as longer-trained FF-based models. In essence, the UFVS dataset exhibits a more robust characterization of excessive rainfall and impacts, and machine learning models trained on more representative datasets of meteorological hazards may not require as extensive training to generate skillful forecasts.

Hill, Aaron J., Russ S. Schumacher, and Mitchell L. Green, Jr. “Observation Definitions and their Implications in Machine Learning-based Predictions of Excessive Rainfall”, Weather and Forecasting (published online ahead of print 2024), https://doi.org/10.1175/WAF-D-24-0033.1

Day 1, 2, and 3 forecasts from the machine learning-based prediction system detailed in the associated manuscript (citated above) as well as those from the Weather Prediction Center (WPC) and observations (Unified Flood Verification System; UFVS) of excessive rainfall hazards. Forecasts, outlooks, and observations for each forecast day are contained in a single netCDF file and labeled accordingly (e.g., day1_csu_mlp_20201005_20231003.nc). An additional forecast file (i.e., day1_exps_20201005_20231003.nc) contains a number of experimental machine learning-based forecasts (more than the other forecast days) that are detailed in the manuscript. Four additional data files are included, including a spatio-temporally varying climatology derived from the UFVS dataset, UFVS observations used for training, and two fixed-frequency observation datasets used for machine learning model training.

Description of the Data and file structure

A daily climatology of excessive rainfall is provided by the file “ufvs_climo_daily_smoothed_20161001_20220930.nc”. This climatology is computed using a spatio-temporally varying filter function to smooth daily observations of excessive rainfall from the UFVS dataset, with observations spanning from 1 October 2016 to 30 September 2022.

Variables in file:
‘daily_climo_smooth’: Daily climatology (units: probability) smoothed in space and time on the defined latitude and longitude grid
‘dayofyear’: Value from 1-365 corresponding to day of the year. Used as an index for ‘daily_climo_smooth’ variable.

Each forecast file (i.e., day1_csu_mlp_20201005_20231003.nc, day2_csu_mlp_20201005_20231003.nc, and day3_csu_mlp_20201005_20231003.nc) is structured as follows (example for the day 1 forecasts):

dimensions:
lat = 276 ;
lon = 721 ;
time = 1081 ;
variables:
double lat(lat) ;
double lon(lon) ;
int64 time(time) ;
int64 init(time) ;
int64 lead ;
float csu_mlp_2022_gefso(time, lat, lon) ;
float csu_mlp_2022_gefso_discrete(time, lat, lon) ;
float wpc_ero(time, lat, lon) ;
int64 ufvs_verf(time, lat, lon) ;
float csu_mlp_2022_gefso_ufvs(time, lat, lon) ;
float csu_mlp_2022_gefso_ufvs_discrete(time, lat, lon) ;

Probabilistic orecasts from the machine learning models discussed in the manuscript text are contained within the “csu_mlp_2022_gefso” and “csu_mlp_2022_gefso_ufvs” variables, corresponding to forecasts from the fixed-frequency and UFVS models, respectively.
Variables with “_discrete” have discretized forecasts that match the categorical definitions of the WPC excessive rainfall outlooks.
The “wpc_ero” variable contains the probabilistic WPC excessive rainfall outlook forecasts.
The “ufvs_verf” variable is binary (0 or 1) gridded UFVS observations - an event occurred or didn’t.
Each variable has dimensions of (time, latitude, longitude).
Either all forecasts/observations or a subset therein can be retrieved from the netCDF variable by using the time variable, which contains strings of each valid forecast day.

An additional forecast file is provided (“day1_exps_20201005_20231003.nc”) which contains forecasts from training experiments as outlined in the manuscript text.

Data variables:
double lat(lat) ;
double lon(lon) ;
int64 time(time) ;
float ufvs_2016_2019(time, lat, lon) ;
float ufvs_2016_2019_discrete(time, lat, lon) ;
float wpc_ero(time, lat, lon) ;
int64 ufvs_verf(time, lat, lon) ;
float ufvs_2018_2019(time, lat, lon) ;
float ufvs_2018_2019_discrete(time, lat, lon) ;
float ufvs_2019(time, lat, lon) ;
float ufvs_2019_discrete(time, lat, lon) ;
float ff_2003_2013(time, lat, lon) ;
float ff_2003_2013_discrete(time, lat, lon) ;
float ff_2009_2019(time, lat, lon) ;
float ff_2009_2019_discrete(time, lat, lon) ;
float ff_2016_2019(time, lat, lon) ;
float ff_2016_2019_discrete(time, lat, lon) ;
float ff_2018_2019(time, lat, lon) ;
float ff_2018_2019_discrete(time, lat, lon) ;
float ff_2019(time, lat, lon) ;
float ff_2019_discrete(time, lat, lon) ;

As in the other forecast files, continuous and discrete forecasts are provided for experiments based on their suffix, i.e., no suffix for continuous foreasts and “_discrete” suffix for discretized forecasts. UFVS-based models use the “ufvs” prefix and fixed frequency models use the “ff” prefix. The years provided in each variable name correspond to the training period of each experiment, as outlined in the manuscript. The WPC ERO (wpc_ero) and UFVS observations (ufvs_verf) is duplicated here for ease of verification.

Also included in this repository are gridded observations used for training the fixed-frequency and UFVS-based models.

FF_1yr_24h_training.ncfile:
dimensions:
days = 3896 ;
lats = 71 ;
lons = 151 ;
variables:
float FFR_CCPA(days, lats, lons) ;

FFR_CCPA variable is all gridded binary observations of flash flood reports and 1yr, 24h average recurrence interval exceedances as defined in the manuscript.

FF_2yr_24h_training.nc file:
dimensions:
days = 3896 ;
lats = 71 ;
lons = 151 ;
variables:
float FFR_CCPA(days, lats, lons) ;

FFR_CCPA variable is all gridded binary observations of flash flood reports and 2yr, 24h average recurrence interval exceedances as defined in the manuscript.

UFVS_training.nc file:
dimensions:
days = 1242 ;
latitude = 71 ;
longitude = 151 ;
variables:
double lats(latitude, longitude) ;
lats:_FillValue = NaN ;
double lons(latitude, longitude) ;
lons:_FillValue = NaN ;
int64 ufvs_gridded_obs(days, latitude, longitude) ;
ufvs_gridded_obs:coordinates = “lons lats” ;

ufvs_gridded_obs variable is all gridded binary observations of UFVS observations as defined in the paper. “lats” and “lons” correspond to the latitude and longitudes of the gridded domain.

Data from: Observation definitions and their implications in machine learning-based predictions of excessive rainfall

Data files

Abstract

README: Data from: Observation definitions and their implications in machine learning-based predictions of excessive rainfall

Description of the Data and file structure

Methods

Works referencing this dataset