Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

Greenhill, Simon 1 ; Druckenmiller, Hannah2; Wang, Sherrie3; Keiser, David4; Girotto, Manuela1; Moore, Jason5; Yamaguchi, Nobuhiro1; Todeschini, Alberto1; Shapiro, Joseph1

Published Dec 10, 2023 on Dryad. https://doi.org/10.5061/dryad.z34tmpgm7

Abstract

We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.

https://doi.org/10.5061/dryad.z34tmpgm7

This dataset contains data used to produce the predictions and other results reported in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper via this repository. We are also providing access to all data used to train the models in another Dryad repository: https://doi.org/10.5061/dryad.m63xsj47s. All code written for the project is available at https://doi.org/10.5281/zenodo.10108709.

Description of the data and file structure

The files here include:

Model predictions, including for grid points, PJDs, Sackett points, and navigable waters: wotus_model_predictions.zip, resource_type_model_predictions.zip, cowardin_code_predictions.zip, ajd_model_predictions.zip
Input layers used for prediction:
- grid_1.zip: a random sample of approximately 80,000 of the prediction grid points
- navigable_water_prediction_points.zip
- Sackett_prediction_points.zip
- PJD_prediction_points.zip
Auxiliary data:
- text_questions_data.zip: Data used for producing in-text statistics
- ajd_point_intersections.csv: Intersections between AJDs and various other geophysical layers such as NWI polygons, NHD flowlines, urban growth areas, etc.
- prediction_points_to_drop.zip: pickle objects containing IDs of missing and/or corrupted prediction points. These can occur when a layer is missing data for the requested prediction point.
- prediction_point_metadata.csv, sackett_metadata.csv: files containing metadata about the 4 million and Sackett prediction points, respectively, including state, ACE district, and HQ distance. This information is used to create ordinal layers. Note that PJD_prediction_points.zip and navigable_water_prediction_points.zip contain analogous files, pjd_metadata.csv and nav_water_metadata for those sets of prediction points.
Data for creating displays: table*_data.zip, figure*_data.zip

Description of file contents

Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values.

Model predictions

The files in this category are wotus_model_predictions.zip, resource_type_model_predictions.zip, cowardin_code_predictions.zip, ajd_model_predictions.zip. Each of these is a zipped directory containing model outputs (predictions) from each model.
The directories and the files in each of them are:

wotus_model_predictions: predictions from WOTUS-ML. There are files for each of the 50 grids (e.g., grid1_predictions_Rapanos.csv); the points on traditional navigable waters (e.g. nav_water_predictions_Rapanos.csv), the 3,000 points near the Sackett property (e.g. Sackett_predictions_Rapanos.csv; predictions the 101,000 preliminary jurisdictional determinations (PJDs); and predictions for the training, validation, and test sets of Approved Jurisdictional Determinations (AJDs). See SM section A.3 for details on the AJD and PJD data and SM section A.5 for details on the other prediction points. For each set of prediction points, there are three files, corresponding to each of the three WOTUS rules we analyze. We denote the rules using suffixes: _Rapanos for Rapanos, _CWR for the Clean Water Rule, and _NWPR for the Navigable Waters Protection Rule.
- Each of the grid, navigable waters, and Sackett predictions files has the same two columns:
  - pointid: the point identifier
  - probability_wotus: the WOTUS-ML model score (a number between 0 and 1).
- The training, validation, and test set predictions have the following columns:
  - pointid: same as above
  - probability_wotus: same as above
  - predictions: the rounded WOTUS-ML model score (0 or 1)
  - labels: the WOTUS decision from the AJD (0 or 1)
  - preds_batch: the batch number (training and validation predictions only)
  - epoch: the epoch (training and validation predictions only).
- The file all_preds.shp, along with its auxiliary files, is a shapefile containing the combined grid predictions for all rules and grids. The columns of the shapefile’s attribute table are:
  - grid_cell: the grid cell number
  - process_or: the grid number (1 through 50)
  - lon: longitude
  - lat: latitude
  - prediction_id; pointid: the prediction pointid
  - Rapanos_pr, CWR_prob, NWPR_prob: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.
  - district: the Army Corps of Engineers (ACE) district, using the district abbreviation.
- In addition, the training and validation predictions have columns ajd_predictions, the rounded WOTUS-ML model score (0 or 1) and
ajd_model_predictions: predictions from AJD-ML. Files include train_preds.csv, val_preds.csv, and test_set_predictions.csv. These contain predictions for the training, validation, and test set, respectively. Variables in each file include:
- pointid: the point identifier.
- probability_ajd: the AJD-ML model score (number between 0 and 1).
- ajd_predictions: the rounded AJD-ML model score (0 or 1).
- ajd_labels: the label (0 or 1).
- epoch: the epoch for which the other values were calculated. Note that test_set_predictions.csv does not have this column as prediction on the test set was done only after model training using the best model

Input layers used for prediction

The files in this category are grid_1.zip, navigable_water_prediction_points.zip, Sackett_prediction_points.zip, and PJD_prediction_points.zip. Each of these is a zipped directory containing input layers that are fed into WOTUS-ML to produce predictions. All files are formatted as Nx512x512 arrays centered at the point of interest and saved as tiff files. Some files contain a single layer (N=1), but others contain up to 9 layers (N=9). The files are named according to their prediction ids, which are described in the metadata files (see the “Metadata” section below). For more details on the variables, see Table S2 in the Supplementary Materials. The subfolders are:

NAIP: 4x512x512 arrays containing the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imagery
NWI: 1x512x512 arrays containing wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8.
NHD: 5x512x512 arrays containing features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment).
DEM: 1x512x512 arrays containing elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program’s digital elevation model (DEM).
PRISM: 9x512x512 arrays containing 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; and cloudiness.
PPT: 1x512x512 arrays containing average annual total precipitation at the point, from PRISM.
Ecoregions: 1x512x512 arrays containing information about the point’s ecoregion from the US EPA Level IV Ecoregions.
gNATSGO: 5x512x512 arrays containing soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit is a “hydric soil”), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer).
NLCD: 1x512x512 arrays containing the point’s land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous).
ACE_districts: 1x512x512 arrays corresponding to which ACE district each point is located in. These are encoded numerically; see 2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R in the code repository for details.
states: 1x512x512 arrays corresponding to which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see 2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R in the code repository for details.
rules: 1x512x512 arrays corresponding to each WOTUS rule. These are stored in subdirectories for each rule: CWR, NWPR, and Rapanos, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR.
dist_HQ: 1x512x512 arrays containing the distance, in meters, from the point to the headquarters of the ACE district the point is in.

In addition, grid_1.zip, navigable_water_prediction_points.zip, and PJD_prediction_points.zip contain information about points that should be dropped from the metadata because they are missing from the input layers or are corrupted files, which can occur if the requested area is in an ocean, Great Lake, or restricted military area. These files contain numpy arrays of the points to be dropped. For more details, see the prediction code under 4_dl_models in the code repository.

Auxiliary data

The contents of the auxiliary files include:

text_questions_data.zip:
- DeregulatedPointsRapanosToNWPR.csv: prediction points from the 4 million grid points which are deregulated between Rapanos and NWPR. Column names are the same as column names in all_preds.shp.
- prediction_point_metadata.csv: state, district, and distance to headquarters information for the 4 million grid points. Columns are:
  - prediction_id: the point identifier
  - state: the state FIPS code
  - district: the ACE district, using the district abbreviation.
  - distHQ: the distance, in meters, from the point to the headquarters of the ACE district the point is in.
- sample_grid_points.csv: the 4 million grid points. Columns are:
  - grid_cell: the grid cell number
  - process_or: the grid number (1 through 50)
  - lon: longitude
  - lat: latitude
  - prediction_id: the prediction pointid
- testSetWithAJDinfo.csv: the AJD test set with additional information merged in from the AJD database. Key columns are:
  - jdid: the id assigned by ACE
  - agency: the agency making the jurisdictional determination
  - projectid: the project id assigned by ACE
  - districtorregion: the ACE district
  - jdbasis: the WOTUS rule used as the basis for the determination
  - pdflink: a link to a pdf of the determination, if available
  - finalizeddate: date the determination was finalized
  - closuremethod: Whether the determination required a field visit or not
  - watname: the name of the water resource evaluated for the determination
  - resourcetypes: the short code describing the resource type
  - resourcetypedescription: a longer description of the resource type
  - wateroftheus: WOTUS decision (Yes or No)
  - cowardincode: the Cowardin code
  - cowardincategory: The Cowardin category
  - cowardindescription: the description of the Cowardin category
  - longitude: the longitude of the centroid of the water resource (see SM section A.4 for discussion)
  - latitude: the latitude of the centroid of the water resource (see SM section A.4 for discussion)
  - state: US state name
  - county: US county name
  - pointid: the point identifier
  - prob_cnn: the WOTUS-ML score
  - predictions: the rounded WOTUS-ML model score (0 or 1)
  - labels: the WOTUS decision (0 or 1)
  - group_all: indicator for an AJD decided under any rule
  - group_rapanos: indicator for an AJD decided under Rapanos
  - group_nwpr: indicator for an AJD decided under NWPR
  - group_cwr: indicator for an AJD decided under CWR
  - ajd_decision: the WOTUS decision (0 or 1)
  - accuracy: share of AJDs with rounded WOTUS-ML model score (0 or 1) equal to WOTUS decision (0 or 1)
  - sh_above_score_cutoffs: share of validation AJDs with WOTUS-ML score above each cutoff in score_cutoffs_hi.
  - accuracy_above_score_cutoffs: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score above each cutoff in score_cutoffs_hi. Used to graph accuracy curves
  - score_cutoffs_hi: cutoffs from 0.5-1.0 used by accuracy_above_score_cutoffs
  - sh_below_score_cutoffs: share of validation AJDs with WOTUS-ML score below each cutoff in score_cutoffs_lo.
  - accuracy_below_score_cutoffs: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score below each cutoff in score_cutoffs_lo. Used to graph accuracy curves
  - score_cutoffs_lo: cutoffs from 0.0-0.50 used by accuracy_below_score_cutoffs
  - xval: score cutoffs on x axis for accuracy curve
  - yval: share of validation AJDs with at least the accuracy in xval. Used to graph accuracy curve
ajd_point_intersections.csv:
- pointid: The AJD point identifier
- nwi: boolean; true if the point intersects any NWI polygon; false otherwise
- nwi_wetland_type: NA if nwi == False, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other)
- nhd: boolean; true if the point intersects any NHD polygon; false otherwise
- nhd_fcode: NA if nhd == False, otherwise the fcode from NHD
- navigable_water: boolean; true if the point is a navigable water as defined in SM section A.5; false otherwise
- navigable_water_and_nwi: bolean; true if navigable_water == True and nwi == True; false otherwise
- iclus_growth: boolean; true if the point is in an area defined by ICLUS to move from undeveloped to semi-developed, semi-developed to developed, or undeveloped to developed. See SM section A.4 for details

Metadata

sackett_metadata.csv:
- id: The Sackett prediction id
- state: the state FIPS code
- district: the 3-letter ACE district abbreviation
- distHQ: the distance, in meters, of the point to the ACE district headquarters
prediction_point_metadata.csv:
- prediction_id: the prediction point id
- state: the state FIPS code
- district: the 3-letter ACE district abbreviation
- distHQ: the distance, in meters, of the point to the ACE district headquarters

Data for displays

The zip files table*_data.zip, figure*_data.zip contain data necessary to replicate each of the figures and tables in the main text and SM, and do not appear elsewhere in this replication package. In some cases, files are duplicated from elsewhere in this repository, and in some cases the contents of the zip files are identical. Note there are no zip files corresponding to figures that do not require any data, e.g. figure 1.

fig2A_data, figS6_data, figS9_data:
- test_set_predictions.csv: See description of wotus_model_predictions.zip above.
- AJD_jds_wet_dry_season_clean_v2.csv, jds202205312309.csv: See description of testSetWithAJDInfo.csv above.
- Note that this figure can also be replicated using testSetWithAJDInfo.csv only. Also note that all three zip files contain the same data.
fig2B_data:
- prediction_point_metadata.csv, sample_grid_points.csv, testSetWithAJDinfo.csv: See file descriptions above.
fig3_data:
- prediction_point_metadata.csv, sample_grid_points.csv: See file descriptions above.
- preds.csv: The 4 million predictions, combined into a single file. See the description of all_preds.shp above.
fig4_data:
- ID_shapefile_wetlands: a directory containing the shapefile and auxiliary files for Idaho wetlands from NWI.
- zoom_areas:
  - zoom_areas_naip: .tiff files containing NAIP imagery for each of the zoom areas in figure 4.
  - zoom_areas.shp: a shapefile and auxiliary files containing the polygons defining the zoom areas in figure 4.
- Sackett_sample_NAIP_tiles.csv: a file describing the geographic information for the Sackett prediction points. Key fields include:
  - prediction_id: the Sackett prediction ID
  - lat: the latitude coordinate of the point
  - lon: the longitude coordinate of the point
figS3_data:
- prediction_point_metadata.csv, sample_grid_points.csv, preds.csv: See file descriptions above.
figS4_data:
- nhdPlusRegionsCombined.shp: combination of all NHDPlusV2 regions from EPA’s NHD data
  - streamleve: stream level in NHD
- PRISM_ppt_30yr_normal_4kmM4_annual_asc.tif: tif of PRISM precipitation. Downloaded from https://prism.oregonstate.edu/normals/
- nlcd_2019_land_cover_l48_20210604.img: NLCD 2019 Land Cover. Downloaded from https://www.mrlc.gov/data?f%5B0%5D=category%3ALand%20Cover&f%5B1%5D=category%3Aland%20cover&f%5B2%5D=region%3Aconus&f%5B3%5D=year%3A2019
- USGSNAIPImagery.tif: NAIP Imagery
- NAIPmapping.qgz, NLCDmapping.qgz, PRISMmapping.qgz: QGIS projects used to create their respective maps
table1_data:
- sample_grid_points.csv, test_set_predictions_*.csv, testSetWithAJDinfo.csv: See file descriptions above.
- navigable_comids_wlatlon.txt:
  - comid: COMID (stream segment identifier from NHD)
  - latitude: latitude coordinate
  - longitude: longitude coordinate
  - gnis_name: stream name from the USGS Geographic Name Information System
tableS4_data, tableS6_data:
- AJD_jds202205312309_clean.csv, ajd_point_intersections.csv, AJD_jds_wet_dry_season_clean_v2.csv: See file descriptions above.
- pointid_resourcetype_crosswalk.csv:
  - pointid: the AJD pointid.
  - ai_cowardin: a 9-class categorization of cowardin codes (see table S1)
  - cowardin_numeric: a numeric encoding of ai_cowardin
  - cowardin_simple: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper.
  - ai_resourcetype: a 9-class categorization of resource types (see table S2)
  - resource_numeric: a numeric encoding of ai_resourcetype
tableS5_data:
- prediction_point_metadata.csv, sample_grid_points.csv, testSetWithAJDinfo.csv: See file descriptions above.
tableS7_data:
- prediction_point_metadata.csv, sample_grid_points.csv, testSetWithAJDinfo.csv: See file descriptions above.
- nhd_stats_AI_state.csv:
  - comid: COMID
  - long_comid: the COMID’s longitude
  - lat_comid: the COMID’s latitude
  - ftype: the NHD feature type
  - fcode: the NHD feature code
  - intephem: 1 if ephemeral, 0 otherwise
  - streamorder: Stream Order
  - lengthkm: Path length in km
  - STUSPS: FIPS state postal code
- nhd_stream_miles_by_state.csv:
  - STUSPS: 2-character USPS state code
  - lengthmi: Total stream length, in miles
- nwi_acres_by_state.csv:
  - NAME: State name
  - STUSPS: 2-character USPS state code
  - STATEFP: State FIPS code
  - nwi_all_acres: Total NWI acres
  - nwi_wetland_acres: Total NWI acres in one of the wetland types
tableS8_data:
- PWS_Locations_HUC12_2022Q2.xlsx: list of all public water systems served by water sources within each HUC12
  - HUC_12: HUC12 region
  - PWSID: public water system id of systems served by the huc12
- WBD_HUC12.shp: shapefile and auxiliary files for the HUC 12 watershed boundary dataset
  - huc12: HUC12 region
- PredictionPointsByHuc12PWSIDNhdNwiPopulationServed.csv: spatial join of WOTUS-ML prediction points to HUC12 polygons (from WBD_HUC12), the public water systems served by said HUC12 (from PWS_Locations_HUC12_2022Q2) and the population served by each public water system (from sdwis_active_years).
  - prediction_id: the prediction point id
  - pwsid: public water system id
  - population_served: population served by the pwsid
  - dereg: indicator if WOTUS-ML predicts the prediction point is regulated under Rapanos, but not regulated under NWPR
  - Rapanos_prob, CWR_prob, NWPR_prob: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.
  - Rapanos_prediction, CWR_prediction, NWPR_prediction: the rounded WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.
  - nwi: boolean; true if the point intersects any NWI polygon; false otherwise
  - nwi_wetland_type: NA if nwi == False, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other)
  - nhd: boolean; true if the point intersects any NHD polygon; false otherwise
  - nhd_fcode: NA if nhd == False, otherwise the fcode from NHD
- sdwis_active_years.dta: list of public water systems active in the Environmental Protection Agency’s SDWIS database in each year.
  - pwsid: public water system id
  - pws_type_code: public water system type (community water system - CWS; non-transient non-community water system - NTNCWS; transient non-community water system - TNCWS)
  - active: indicator; 1 if this pwsid was active in this year
  - year: calendar year