Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates

Greenhill, Simon 1 ; Druckenmiller, Hannah2; Wang, Sherrie3; Keiser, David4; Girotto, Manuela1; Moore, Jason5; Yamaguchi, Nobuhiro1; Todeschini, Alberto1; Shapiro, Joseph1

Published Dec 12, 2023 on Dryad. https://doi.org/10.5061/dryad.m63xsj47s

Abstract

We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.

https://doi.org/10.5061/dryad.m63xsj47s

This dataset contains data used to train the models in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: https://doi.org/10.5061/dryad.z34tmpgm7. All code written for the project is available at https://doi.org/10.5281/zenodo.10108709.

Description of the data and file structure

The files here include:

Trained models, saved in PyTorch Checkpoint format: wotus_model.pth.tar, resource_type_model.pth.tar, cowardin_code_model.pth.tar, ajd_model.pth.tar.
Train test splits and inputs to their creation:
- train_test_split_naip_only.csv: The split used for the Waters of the United States (WOTUS) model
- ajd_train_test_split.csv: The split used for the Approved Jurisdictional Determination (AJD) model
- NAIP_tiles.zip: a shapefile of the geographic footprints of the imagery tiles, which are used to group overlapping footprints
Raw data: raw_ajd_data.zip, raw_resourcetype_cowardincode_data.zip
Processed input data, including all the layers used to train and evaluate the model. These have been normalized and augmented, and are saved in a compressed .npz format: train_val_data.zip and test_data.zip.

Description of file contents

Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values.

Trained models

These are saved PyTorch objects. To load the model, instantiate a ResNet-18, then load the checkpoint using torch.load. For details, see the PyTorch documentation. For an example, see 4_dl_models/wotus/predict/predict_grid.py in the code repository.

Train test splits and inputs to their creation

train_test_split_naip_only.csv:
- pointid: the AJD point id
- jdid: the id assigned by the Army Corps of Engineers (ACE)
- projectid: project id assigned by ACE
- split_group: the split assignment (train, test, or val)
- ace_district: the ACE district of the pointid
- rule: the WOTUS rule used to decide the AJD
- wotus: the WOTUS decision (Yes or No)
ajd_train_test_split.csv:
- id: the point id. The format of this field is ajd_XXXXXX if the point is drawn from the AJD dataset and pred_XXXXXX if the point is drawn from the 4 million prediction points
- pointid: the AJD point id if the point is from the AJD dataset, empty otherwise
- projectid: the AJD project id if the point is from the AJD dataset, empty otherwise
- ajd: 1 if the point is from the AJD dataset, 0 if it is from the 4 million prediction points
- prcss_r: the grid number (1 through 50) if the point is from the 4 million prediction points, empty otherwise
- group_id: the id of the group of overlapping points
- split_group: the split assignment (train, test, or val)
NAIP_tiles.zip: a shape file containing the geographic footprints of the NAIP tiles

Raw data files

raw_ajd_data.zip: a zip archive containing AJD data.
- jds202205312309_clean.csv; jds202204211420_clean.csv:
  - JD ID: the id assigned by ACE
  - Agency: the agency making the jurisdictional determination
  - Project ID: the project id assigned by ACE
  - District or Region: the ACE district
  - JD Basis: the WOTUS rule used as the basis for the determination
  - PDF Link: a link to a pdf of the determination, if available
  - Finalized Date: date the determination was finalized
  - Closure Method: Whether the determination required a field visit or not
  - Waters Name: the name of the water resource evaluated for the determination
  - Resource Types: the short code describing the resource type
  - Resource Type Description: a longer description of the resource type
  - Water of the U.S.: WOTUS decision (Yes or No)
  - Cowardin Code: the Cowardin code
  - Cowardin Category: The Cowardin category
  - Cowardin Description: the description of the Cowardin category
  - Longitude: the longitude of the centroid of the water resource (see SM section A.4 for discussion)
  - Latitude: the latitude of the centroid of the water resource (see SM section A.4 for discussion)
  - State: US state name
  - County: US county name
- jds_wet_dry_season_clean_v2.csv:
  - jdid: the id assigned by ACE
  - agency: the agency making the jurisdictional determination
  - projectid: the project id assigned by ACE
  - districtorregion: the ACE district
  - jdbasis: the WOTUS rule used as the basis for the determination
  - pdflink: a link to a pdf of the determination, if available
  - finalizeddate: date the determination was finalized
  - closuremethod: Whether the determination required a field visit or not
  - watname: the name of the water resource evaluated for the determination
  - resourcetypes: the short code describing the resource type
  - resourcetypedescription: a longer description of the resource type
  - wateroftheus: WOTUS decision (Yes or No)
  - cowardincode: the Cowardin code
  - cowardincategory: The Cowardin category
  - cowardindescription: the description of the Cowardin category
  - longitude: the longitude of the centroid of the water resource (see SM section A.4 for discussion)
  - latitude: the latitude of the centroid of the water resource (see SM section A.4 for discussion)
  - state: US state name
  - county: US county name
raw_resourcetype_cowardincode_data.zip: a zip archive containing cowardin code and resource type data for the AJDs.
- ajds_with_resourceTypes_for_multiTaskLearning.csv:
  - ai_cowardin: a 9-class categorization of cowardin codes (see table S1)
  - ai_resourceType: a 9-class categorization of resource types (see table S2)
  - All other columns same as above.
- pointid_resourcetype_crosswalk.csv:
  - pointid: the AJD pointid.
  - ai_cowardin: a 9-class categorization of cowardin codes (see table S1)
  - cowardin_numeric: a numeric encoding of ai_cowardin
  - cowardin_simple: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper.
  - ai_resourcetype: a 9-class categorization of resource types (see table S2)
  - resource_numeric: a numeric encoding of ai_resourcetype

Processed input data

The files in this category are train_val_data.zip (training and validation data) and test_data.zip (test data). Each of these is a zipped directory containing input layers that are fed into WOTUS-ML. All files are numpy array saved as .npz files. The file naming convention is {pointid}.npz, where pointid is the point id.
In addition, there is a dictionary saved as a pickle file, data_dict.p, containing metadata about the files.
This dictionary is populated automatically by the code that creates the input layers (see 3_src/data.py in the code repository).
The dictionary is reproduced below:

{'naip_ids': [0, 1, 2, 3],
 'nwi_id': [4],
 'nhd_ids': [5, 6, 7, 8, 9],
 'dem_id': [10],
 'ecoregion_id': [11],
 'nlcd_id': [12],
 'prism_ids': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
 'gnatsgo_ids': [23, 24, 25, 26, 27],
 'district_dummies_id': [28],
 'rule_dummies_id': [29, 30, 31],
 'hq_dist_id': [32],
 'state_id': [33],
 'augment': True,
 'normalize': True}

The keys are the names of the input layers, and the values are lists of the indices of the layers in the numpy array. The augment and normalize keys indicate whether the data were augmented and normalized, respectively. Augmentation includes random rotation and flipping. Normalization is done using the mean and standard deviation of each layer in the training data. The values used for normalization are stored in the folder 4_dl_models/wotus/train/layer_mean_sd in the code repository.
More details about the layers are provided below. Also see SM Table S3.

naip_ids: the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imagery
nwi_id: wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8.
nhd_id: features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment).
dem_id: elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program’s digital elevation model (DEM).
ecoregion_id: information about the point’s ecoregion from the US EPA Level IV Ecoregions.
nlcd_id: the point’s land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous).
prism_ids: 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; cloudiness; and average annual total precipitation.
gnatsgo_ids: soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit a “hydric soil”), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer).
district_dummies_id: which ACE district each point is located in. These are encoded numerically; see 2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R in the code repository for details.
rule_dummies_id: each WOTUS rule. These are stored in subdirectories for each rule: CWR, NWPR, and Rapanos, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR.
hq_dist_id: the distance, in meters, from the point to the headquarters of the ACE district the point is in.
state_id: which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see 2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R in the code repository for details.