River metabolism in the contiguous United States: Random forest model code, inputs and outputs
Data files
Sep 12, 2025 version files 1.43 GB
-
appling_data_06202024.csv
358.39 MB
-
ER_lower_pred_meanQ.zip
26.77 MB
-
ER_upper_pred_mean_Q.zip
22.96 MB
-
fig_1_nep_maps.pdf
369.90 MB
-
GPP_lower_pred_meanQ.zip
25.08 MB
-
GPP_upper_pred_meanQ.zip
24.59 MB
-
MERIT_Hydro_CONUS_catchments.zip
197.47 MB
-
random_forest_CONUS_ER.ipynb
684.88 KB
-
random_forest_CONUS_GPP.ipynb
692.38 KB
-
README.md
8.98 KB
-
scaling_dataset_flow_pct_07102024.csv
102.46 MB
-
scaling_dataset_Qmonth_06082024.csv
127.33 MB
-
scaling_ER_pred.zip
24.85 MB
-
scaling_ER_Q10.zip
24.55 MB
-
scaling_ER_Q90.zip
24.89 MB
-
scaling_GPP_pred_30cap.zip
24.71 MB
-
scaling_GPP_pred.zip
24.78 MB
-
scaling_GPP_Q10.zip
24.44 MB
-
scaling_GPP_Q90.zip
24.86 MB
Abstract
River metabolism is among the most uncertain fluxes in the global carbon cycle. We present estimates for gross primary productivity (GPP) and ecosystem respiration (ER) for over 175,000 rivers across the contiguous United States (CONUS), including metabolic responses to extreme hydrological conditions. Our model predicts annual GPP in CONUS rivers of 10.1 Tg-C yr -1 and ER of 18.7 Tg-C yr -1, implying that net ecosystem productivity (NEP = GPP – ER) is a small contributor to river CO2 emissions. More than 70% of river metabolism occurs in the west, where regions of both extreme heterotrophy and autotrophy exist. Autotrophy is prominent across the west and is sensitive to drought, particularly in understudied biomes like arid desert shrublands, which may indicate that global riverine uptake of CO2 is higher than hypothesized.
This folder contains the code, input files and output files for the random forest models discussed in "River metabolism in the contiguous United States: A west of extremes" by Taylor Maavara, Zimin Yuan, Andrew Johnson, Shuang Zhang, Kelly S. Aho, Craig B. Brinkerhoff, Laura A. Logozzo and Peter Raymond, in the journal Science.
The code was developed using Python3 using RandomForestRegressor in scikit-learn 1.4.2, with pandas 2.2.1, numpy 1.26.4, matplotlib 3.8.3, scipy 1.13.0 and associated dependencies. If you have any questions feel free to contact Taylor Maavara at maavarat@caryinstitute.org.
The following files are included:
'random_forest_CONUS_GPP.ipynb' and 'random_forest_CONUS_ER.ipynb': these Jupyter notebooks contain the Python script to test and train the GPP and ER random forest models, respectively. These notebooks include hyper-parameter tuning, training the model using the full 25 features, variable importance testing, partial dependence plots, simplifying the model to 7 features, and scaling to all CONUS reaches. These notebooks also contain sections for running the 2.5% and 97.5% confidence interval uncertainty analyses, as well as flow scenarios and scenario where GPP is capped at 30 g O2/m2/day, discussed in the paper, and instructions for running the scenarios for different months. The notebooks are commented and include markdown to help understand what each section of code does and indicate where any changes are needed if you want to change what is being output. E.g. you will need to manually change which month is output when scaling to the CONUS. Units in all tabular data input and output files are given below, and are also shown in Table S1 in the supplementary material associated with the paper, where source references for each parameter are also given.
'appling_data_06202024.csv' contains all of the input data needed to train and test the models, with the bulk of this data coming from Appling, A. P., Hall Jr, R. O., Yackulic, C. B. & Arroita, M. Overcoming equifinality: Leveraging long time series for stream metabolism estimation. Journal of Geophysical Research: Biogeosciences 123, 624-645 (2018). Units and variable names for variables from Appling et al 2018 are the same as given in that dataset. Variables added to this dataset for the purposes of this analysis are the following:
- width: stream width, from MERIT Hydro, in meters
- latitude: in decimal degrees
- longitude: in decimal degrees
- COMID: the relevant MERIT Hydro COMID identifier
- unit area: local COMID sub watershed area (km2)
- lengthkm: river length for each MERIT Hydro reach (km)
- sinuosity: river sinuosity from MERIT Hydro (unitless)
- slope: river slope from MERIT Hydro (unitless)
- uparea: total upstream contributing watershed area to each MERIT Hydro reach (km2)
- order: strahler stream order
- TCC_mean: total canopy cover (percentage)
- elev_mean: mean elevation in the centre of the sub-watershed (meters above sea level)
- TN_median: median total nitrogen concentration (ppm), seasonal
- TP_median: median total phosphorus concentration (ppm), seasonal
- precip_mean: mean monthly precipitation per MERIT Hydro sub watershed (mm)
- temp_mean: mean monthly air temperature per MERIT Hydro sub watershed (degrees C)
- crops_mean: local MERIT Hydro proportion of land cover that is crops (unitless)
- forest_mean: local MERIT Hydro proportion of land cover that is forested (unitless)
- shrubs_mean: local MERIT Hydro proportion of land cover that is desert, arid, grass/shrub/scrubland (DAGS) (unitless)
- urb_barren_mean: local MERIT Hydro proportion of land cover that is urban or barren (unitless)
- wet_mean: local MERIT Hydro proportion of land cover that is wetland (unitless)
- total_uparea_forest: total upstream area that is forested (km2)
- uparea_prop_shrubs: proportion of total upstream area that is desert, arid, grass/shrub/scrubland (DAGS) (unitless)
- uparea_prop_crops: proportion of total upstream area that is crops (unitless)
- uparea_prop_wet: proportion of total upstream area that is wetland (unitless)
- uparea_prop_urb_barren: proportion of total upstream area that is urban or barren (unitless)
- uparea_prop_forests: proportion of total upstream area that is forested (unitless)
- Explanations for how each of these variables were determined are given in the Methods in the paper.
'scaling_dataset_Qmonth_06082024.csv' contains all features for all 175116 MERIT Hydro river reaches in the CONUS, needed to scale the trained random forest model to the CONUS. Many of these features have the same names as in appling_data_06202024.csv, but those that do not are:
- up_shrubs: proportion of total upstream area that is desert, arid, grass/shrub/scrubland (DAGS) (unitless)
- up_crops: proportion of total upstream area that is crops (unitless)
- up_wet: proportion of total upstream area that is wetland (unitless)
- up_barren: proportion of total upstream area that is urban or barren (unitless)
- up_forest: proportion of total upstream area that is forested (unitless)
- shortwave_[month]_mean: where [month] is replaced by the 3-letter abbreviation for each month; average monthly shortwave irradiation for each MERIT Hydro subwatershed (W/m2)
- lat: latitude (decimal degrees)
- merit_qmean: mean annual average discharge for each MERIT Hydro reach (m3/s)
- merit_width: river width (m) for each MERIT Hydro reach
- merit_depth: river depth (m) for each MERIT Hydro reach
- Q_avg_[month]: where [month] is replaced by the name of each month; average monthly discharge for each reach, from GRADES, averaged from 40 years of daily model output (m3/s)
- ML_river_temp_[no]: where [no] is replaced by number 1-12, corresponding to the month of the year; mean monthly river water temperature (degrees C) as predicted using random forest for this analysis.
- Explanations for how each of these variables were determined are given in the Methods in the paper.
'scaling_dataset_flow_pct_07102024.csv' contains the same variables as the previous file except instead of mean monthly flows, flow percentiles are given as separate columns, labelled 'Q_0','Q_10','Q_20','Q_30','Q_40','Q_50','Q_60','Q_70','Q_80','Q_10','Q_91','Q_92','Q_93','Q_94','Q_95','Q_96','Q_97','Q_98','Q_99','Q_100', where the numerical value corresponds with the flow percentile, i.e. Q_90 represents flows in the 90th percentile, or flood conditions. Units are m3/s.
'scaling_ER_pred.zip': contains the output ER in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'scaling_GPP_pred.zip': contains the output GPP in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'ER_upper_pred_mean_Q.zip': contains the output ER in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID, for the 97.5% confidence interval uncertainty analysis.
'GPP_upper_pred_meanQ.zip': contains the output GPP in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID, for the 97.5% confidence interval uncertainty analysis.
'ER_lower_pred_meanQ.zip': contains the output ER in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID, for the 2.5% confidence interval uncertainty analysis.
'GPP_lower_pred_meanQ.zip': contains the output GPP in g O2/m2/day for mean monthly flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID, for the 2.5% confidence interval uncertainty analysis.
'scaling_ER_Q10.zip': contains the output ER in g O2/m2/day for Q10 flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'scaling_GPP_Q10.zip': contains the output GPP in g O2/m2/day for Q10 flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'scaling_ER_Q90.zip': contains the output ER in g O2/m2/day for Q90 flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'scaling_GPP_Q90.zip': contains the output GPP in g O2/m2/day for Q90 flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID.
'scaling_GPP_pred_30cap.zip': contains the output GPP in g O2/m2/day for mean flows for each MERIT Hydro reach in the CONUS for the specified month, and the associated COMID, for the scenarios where GPP fluxes in the training dataset are capped at 30 g O2/m2/day.
'fig_1_nep_maps.pdf' contains a high resolution version of Figure 1 from the paper.
'MERIT_Hydro_CONUS_catchments.zip' contains shapefile of the CONUS sub-catchments with COMIDs so output can be displayed visually.
