Training data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates
Data files
Dec 12, 2023 version files 162.98 GB
-
ajd_model.pth.tar
135.27 MB
-
ajd_train_test_split.csv
75.54 MB
-
cowardin_code_model.pth.tar
135.43 MB
-
NAIP_tiles.zip
16.11 MB
-
raw_ajd_data.zip
11.54 MB
-
raw_resourcetype_cowardincode_data.zip
5.64 MB
-
README.md
12.47 KB
-
resource_type_model.pth.tar
135.43 MB
-
test_data.zip
17.77 GB
-
train_test_split_naip_only.csv
9.17 MB
-
train_val_data.zip
144.55 GB
-
wotus_model.pth.tar
135.39 MB
Abstract
We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.
https://doi.org/10.5061/dryad.m63xsj47s
This dataset contains data used to train the models in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to the full set of pre-processed inputs for model training via this repository. We are also providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper, in another Dryad repository: https://doi.org/10.5061/dryad.z34tmpgm7. All code written for the project is available at https://doi.org/10.5281/zenodo.10108709.
Description of the data and file structure
The files here include:
- Trained models, saved in PyTorch Checkpoint format:
wotus_model.pth.tar
,resource_type_model.pth.tar
,cowardin_code_model.pth.tar
,ajd_model.pth.tar
. - Train test splits and inputs to their creation:
train_test_split_naip_only.csv
: The split used for the Waters of the United States (WOTUS) modelajd_train_test_split.csv
: The split used for the Approved Jurisdictional Determination (AJD) modelNAIP_tiles.zip
: a shapefile of the geographic footprints of the imagery tiles, which are used to group overlapping footprints
- Raw data:
raw_ajd_data.zip
,raw_resourcetype_cowardincode_data.zip
- Processed input data, including all the layers used to train and evaluate the model. These have been normalized and augmented, and are saved in a compressed
.npz
format:train_val_data.zip
andtest_data.zip
.
Description of file contents
Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values.
Trained models
These are saved PyTorch objects. To load the model, instantiate a ResNet-18, then load the checkpoint using torch.load
. For details, see the PyTorch documentation. For an example, see 4_dl_models/wotus/predict/predict_grid.py
in the code repository.
Train test splits and inputs to their creation
-
train_test_split_naip_only.csv
:pointid
: the AJD point idjdid
: the id assigned by the Army Corps of Engineers (ACE)projectid
: project id assigned by ACEsplit_group
: the split assignment (train, test, or val)ace_district
: the ACE district of the pointidrule
: the WOTUS rule used to decide the AJDwotus
: the WOTUS decision (Yes or No)
-
ajd_train_test_split.csv
:id
: the point id. The format of this field isajd_XXXXXX
if the point is drawn from the AJD dataset andpred_XXXXXX
if the point is drawn from the 4 million prediction pointspointid
: the AJD point id if the point is from the AJD dataset, empty otherwiseprojectid
: the AJD project id if the point is from the AJD dataset, empty otherwiseajd
: 1 if the point is from the AJD dataset, 0 if it is from the 4 million prediction pointsprcss_r
: the grid number (1 through 50) if the point is from the 4 million prediction points, empty otherwisegroup_id
: the id of the group of overlapping pointssplit_group
: the split assignment (train, test, or val)
-
NAIP_tiles.zip
: a shape file containing the geographic footprints of the NAIP tiles
Raw data files
raw_ajd_data.zip
: a zip archive containing AJD data.jds202205312309_clean.csv
;jds202204211420_clean.csv
:JD ID
: the id assigned by ACEAgency
: the agency making the jurisdictional determinationProject ID
: the project id assigned by ACEDistrict or Region
: the ACE districtJD Basis
: the WOTUS rule used as the basis for the determinationPDF Link
: a link to a pdf of the determination, if availableFinalized Date
: date the determination was finalizedClosure Method
: Whether the determination required a field visit or notWaters Name
: the name of the water resource evaluated for the determinationResource Types
: the short code describing the resource typeResource Type Description
: a longer description of the resource typeWater of the U.S.
: WOTUS decision (Yes or No)Cowardin Code
: the Cowardin codeCowardin Category
: The Cowardin categoryCowardin Description
: the description of the Cowardin categoryLongitude
: the longitude of the centroid of the water resource (see SM section A.4 for discussion)Latitude
: the latitude of the centroid of the water resource (see SM section A.4 for discussion)State
: US state nameCounty
: US county name
jds_wet_dry_season_clean_v2.csv
:jdid
: the id assigned by ACEagency
: the agency making the jurisdictional determinationprojectid
: the project id assigned by ACEdistrictorregion
: the ACE districtjdbasis
: the WOTUS rule used as the basis for the determinationpdflink
: a link to a pdf of the determination, if availablefinalizeddate
: date the determination was finalizedclosuremethod
: Whether the determination required a field visit or notwatname
: the name of the water resource evaluated for the determinationresourcetypes
: the short code describing the resource typeresourcetypedescription
: a longer description of the resource typewateroftheus
: WOTUS decision (Yes or No)cowardincode
: the Cowardin codecowardincategory
: The Cowardin categorycowardindescription
: the description of the Cowardin categorylongitude
: the longitude of the centroid of the water resource (see SM section A.4 for discussion)latitude
: the latitude of the centroid of the water resource (see SM section A.4 for discussion)state
: US state namecounty
: US county name
raw_resourcetype_cowardincode_data.zip
: a zip archive containing cowardin code and resource type data for the AJDs.ajds_with_resourceTypes_for_multiTaskLearning.csv
:ai_cowardin
: a 9-class categorization of cowardin codes (see table S1)ai_resourceType
: a 9-class categorization of resource types (see table S2)- All other columns same as above.
pointid_resourcetype_crosswalk.csv
:pointid
: the AJD pointid.ai_cowardin
: a 9-class categorization of cowardin codes (see table S1)cowardin_numeric
: a numeric encoding ofai_cowardin
cowardin_simple
: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper.ai_resourcetype
: a 9-class categorization of resource types (see table S2)resource_numeric
: a numeric encoding ofai_resourcetype
Processed input data
The files in this category are train_val_data.zip
(training and validation data) and test_data.zip
(test data). Each of these is a zipped directory containing input layers that are fed into WOTUS-ML. All files are numpy array saved as .npz
files. The file naming convention is {pointid}.npz
, where pointid
is the point id.
In addition, there is a dictionary saved as a pickle file, data_dict.p
, containing metadata about the files.
This dictionary is populated automatically by the code that creates the input layers (see 3_src/data.py
in the code repository).
The dictionary is reproduced below:
{'naip_ids': [0, 1, 2, 3],
'nwi_id': [4],
'nhd_ids': [5, 6, 7, 8, 9],
'dem_id': [10],
'ecoregion_id': [11],
'nlcd_id': [12],
'prism_ids': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
'gnatsgo_ids': [23, 24, 25, 26, 27],
'district_dummies_id': [28],
'rule_dummies_id': [29, 30, 31],
'hq_dist_id': [32],
'state_id': [33],
'augment': True,
'normalize': True}
The keys are the names of the input layers, and the values are lists of the indices of the layers in the numpy array. The augment
and normalize
keys indicate whether the data were augmented and normalized, respectively. Augmentation includes random rotation and flipping. Normalization is done using the mean and standard deviation of each layer in the training data. The values used for normalization are stored in the folder 4_dl_models/wotus/train/layer_mean_sd
in the code repository.
More details about the layers are provided below. Also see SM Table S3.
naip_ids
: the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imagerynwi_id
: wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8.nhd_id
: features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment).dem_id
: elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program’s digital elevation model (DEM).ecoregion_id
: information about the point’s ecoregion from the US EPA Level IV Ecoregions.nlcd_id
: the point’s land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous).prism_ids
: 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; cloudiness; and average annual total precipitation.gnatsgo_ids
: soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit a “hydric soil”), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer).district_dummies_id
: which ACE district each point is located in. These are encoded numerically; see2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R
in the code repository for details.rule_dummies_id
: each WOTUS rule. These are stored in subdirectories for each rule:CWR
,NWPR
, andRapanos
, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR.hq_dist_id
: the distance, in meters, from the point to the headquarters of the ACE district the point is in.state_id
: which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R
in the code repository for details.
This dataset contains data used to train the models.