Skip to main content

Data from: The utility of information flow in formulating discharge forecast models: a case study from an arid snow-dominated catchment

Cite this dataset

Bellugi, Dino et al. (2020). Data from: The utility of information flow in formulating discharge forecast models: a case study from an arid snow-dominated catchment [Dataset]. Dryad.


These data accompany the manuscript “The utility of information flow in formulating discharge forecast models: a case study from an arid snow-dominated catchment”, in review at WRR. They were compiled from Boise State’s University’s Dry Creek Experimental Watershed (DCEW) web site, and consist of measurements of climatic data and discharge at various stations in the watershed from 1 January 2001 through 19 July 2017. The data quality-controlled, gap-filled, and aggregated at scales varying from 1 day to 6 months. For each aggregation scale, anomalies with respect to the day of water year (DOWY) were also calculated. These data were then used to compute an information-theoretic measure, transfer entropy (TE), which quantifies the information flow from each variable (at each scale) to discharge.


Collection and pre-processing:

Raw hourly data for the Tree-Line (TL) and Lower-Gauge (LG) meteorological stations (see manuscript for exact locations) from 1 January 2001 through 19 July 2017 were obtained from Boise State University ( for the following set of variables measured in the Dry Creek Experimental Watershed (DCEW): air temperature, precipitation, relative humidity, solar radiation, snow water equivalent (SWE), wind speed and wind direction. Hourly soil moisture and soil temperature were also obtained from north-facing soil pits at both the LG and TL station at depths that range from 5cm to 100cm. Hourly discharge for the DCEW catchment was obtained for the same period as the meteorological and soil moisture time series from a station just downstream of the LG meteorological station.

All data were carefully quality controlled and any outliers or spurious patterns in the data were identified and removed by hand. Gaps in the data were filled using interpolation, multiple linear regression, and autoregressive models. When possible, multiple linear regression was used to fill gaps as this provided a synthetic record based on observations from within the catchment. When gaps occurred over small time scales at both meteorological stations (LG and TL) autoregressive or linear interpolation was used to in-fill the records. Overall, only a small portion of the time series (< 6% on average) required gap infilling.

Snowmelt was computed from decreases in SWE and a temperature threshold of 0 °C was used to parse precipitation into rainfall or snowfall. We also estimated evaporation at the LG and TL gauges using the Priestley-Taylor method (Priestley & Taylor, 1972), which is based on radiation and is a simplification of the Penman-Monteith combination equation (Monteith, 1981; Penman, 1948). An α of 1.72 was used instead of the commonly used 1.26 to reflect the higher moisture stress of the arid conditions within the catchment. We compared mean monthly estimates of evaporation between the Priestley-Taylor and Penman-Monteith methods with long term (1916 - 2005) pan observations from the nearby Arrowrock Dam, Boise River, Idaho (, and found better qualitative agreement with the Priestly-Taylor estimates.

The quality-controlled data and derived parameters were aggregated to 1, 7, 14, 30, 60, 90, and 183 day time scales. The 1-day time scale was the minimum time stamp evaluated in this study and was computed by taking the daily mean of all meteorological variables except for evapotranspiration, precipitation, rainfall, snowfall, and snowmelt which were computed as a daily total. The 7, 14, 30, 60, 90, and 183 day aggregation scales were computed with a back-looking moving mean. To remove periodic/seasonal trends from the time series we computed an anomaly for each variable at each aggregation length by taking the day of water year (DOWY) mean (based on the full period of record) and then differencing the DOWY mean from the aggregated values. These anomaly time series were used for our transfer entropy analysis and allowed us to detect interactions between hydrologic variables that were not driven by synoptic changes in seasonal conditions. Thus the full suite of candidate variables includes the 61 primary meteorological variables between the two weather stations and four soil pits, with each of these variables having six aggregated time series (1-day, 1-week, 2-week, 1-month, etc.) for a total of 427 time series. Example time series are provided for the LG and TL stations in Figs. S5 and S6.

TE analysis:

To evaluate information transfers from meteorological and hydrological predictor variables to discharge (and for their anomalies to the discharge anomaly), we computed equation 1 for τ ranging from lags of 1 day up to 183 days, using block lengths (k and l) of one. The reduction in uncertainty from a meteorological variable to discharge was deemed statistically significant when the TE value exceeded the 95th percentile of a distribution of TE values computed from 500 randomly shuffled versions of the input data matrices (Ruddell & Kumar, 2009a and b). The full length of the time series used for calculating TE was 6044 days (~16.5 years). Thus, at the maximum lag of 183 days there were still >5800 overlapping data points with which to compute the joint and marginal entropy values for estimating TE. We quantified the relative significance in TE at each lag τ as: T'rel,τ=(T-T0,τ)/HQ, where T0,τ is the significance threshold at that time lag and HQ is the total uncertainty (Shannon entropy; Shannon, 1948) in the sink variable, discharge. T'rel,τ is a normalized version of the TE that quantifies the significant reduction in the uncertainty of discharge relative to the total uncertainty in discharge. The τ associated with the highest T'rel,τ within the first 183 lags was selected as the critical timescale (i.e., most significant lag) to be used in the forecasting models.

Usage notes

The directory contains a Readme file, and each Matlab data file corresponds to an Excel file with the description of the variables and their structure.


Gordon and Betty Moore Foundation, Award: GMBF-4555