Prediction data from: Machine learning predicts which rivers, streams, and wetlands the Clean Water Act regulates
Data files
Dec 10, 2023 version files 260.32 GB
-
ajd_model_predictions.zip
110.69 MB
-
ajd_point_intersections.csv
7.39 MB
-
cowardin_code_model_predictions.zip
434.86 MB
-
fig2A_data.zip
8.85 MB
-
fig2B_data.zip
118.43 MB
-
fig3_data.zip
343.52 MB
-
fig4_data.zip
1.11 GB
-
figS3_data.zip
8.25 GB
-
figS4_data.zip
8.42 GB
-
figS5_data.zip
6.53 MB
-
figS6_data.zip
9.01 MB
-
figS9_data.zip
9.01 MB
-
grid_1.zip
100.01 GB
-
navigable_water_prediction_points.zip
6.25 GB
-
PJD_prediction_points.zip
128.71 GB
-
prediction_point_metadata.csv
128.19 MB
-
prediction_points_to_drop.zip
282.73 KB
-
README.md
23.98 KB
-
resource_type_model_predictions.zip
434.49 MB
-
sackett_metadata.csv
131.88 KB
-
Sackett_prediction_points.zip
2.98 GB
-
table1_data.zip
82.09 MB
-
tableS4_data.zip
9.63 MB
-
tableS6_data.zip
9.63 MB
-
tableS7_data.zip
231.48 MB
-
tableS8_data.zip
1.97 GB
-
text_questions_data.zip
154.28 MB
-
wotus_model_predictions.zip
522.45 MB
Abstract
We assess which waters the Clean Water Act protects and how Supreme Court and White House rules change this regulation. We train a deep learning model using aerial imagery and geophysical data to predict 150,000 jurisdictional determinations from the Army Corps of Engineers, each deciding regulation for one water resource. Under a 2006 Supreme Court ruling, the Clean Water Act protects two-thirds of US streams and over half of wetlands; under a 2020 White House rule, it protects under half of streams and a fourth of wetlands, implying deregulation of 690,000 stream miles, 35 million wetland acres, and 30% of waters around drinking water sources. Our framework can support permitting, policy design, and use of machine learning in regulatory implementation problems.
https://doi.org/10.5061/dryad.z34tmpgm7
This dataset contains data used to produce the predictions and other results reported in Greenhill et al. (2023). All data are publicly available and can be accessed either through Google Earth Engine or directly from the data providers, as described in Table S3 of the Supplementary Material. In addition, we are providing access to a subset of the data used for prediction, as well as all data needed for reproducing the results of the paper via this repository. We are also providing access to all data used to train the models in another Dryad repository: https://doi.org/10.5061/dryad.m63xsj47s. All code written for the project is available at https://doi.org/10.5281/zenodo.10108709.
Description of the data and file structure
The files here include:
- Model predictions, including for grid points, PJDs, Sackett points, and navigable waters:
wotus_model_predictions.zip
,resource_type_model_predictions.zip
,cowardin_code_predictions.zip
,ajd_model_predictions.zip
- Input layers used for prediction:
grid_1.zip
: a random sample of approximately 80,000 of the prediction grid pointsnavigable_water_prediction_points.zip
Sackett_prediction_points.zip
PJD_prediction_points.zip
- Auxiliary data:
text_questions_data.zip
: Data used for producing in-text statisticsajd_point_intersections.csv
: Intersections between AJDs and various other geophysical layers such as NWI polygons, NHD flowlines, urban growth areas, etc.prediction_points_to_drop.zip
:pickle
objects containing IDs of missing and/or corrupted prediction points. These can occur when a layer is missing data for the requested prediction point.prediction_point_metadata.csv
,sackett_metadata.csv
: files containing metadata about the 4 million and Sackett prediction points, respectively, including state, ACE district, and HQ distance. This information is used to create ordinal layers. Note thatPJD_prediction_points.zip
andnavigable_water_prediction_points.zip
contain analogous files,pjd_metadata.csv
andnav_water_metadata
for those sets of prediction points.
- Data for creating displays:
table*_data.zip
,figure*_data.zip
Description of file contents
Each set of files described in the bulleted list above has contents that are structured in similar ways and contain similar information. The contents for each category are described in more detail below, as appropriate. Unless otherwise noted, any N/A or null values represent missing values.
Model predictions
The files in this category are wotus_model_predictions.zip
, resource_type_model_predictions.zip
, cowardin_code_predictions.zip
, ajd_model_predictions.zip
. Each of these is a zipped directory containing model outputs (predictions) from each model.
The directories and the files in each of them are:
wotus_model_predictions
: predictions from WOTUS-ML. There are files for each of the 50 grids (e.g.,grid1_predictions_Rapanos.csv
); the points on traditional navigable waters (e.g.nav_water_predictions_Rapanos.csv
), the 3,000 points near the Sackett property (e.g.Sackett_predictions_Rapanos.csv
; predictions the 101,000 preliminary jurisdictional determinations (PJDs); and predictions for the training, validation, and test sets of Approved Jurisdictional Determinations (AJDs). See SM section A.3 for details on the AJD and PJD data and SM section A.5 for details on the other prediction points. For each set of prediction points, there are three files, corresponding to each of the three WOTUS rules we analyze. We denote the rules using suffixes:_Rapanos
for Rapanos,_CWR
for the Clean Water Rule, and_NWPR
for the Navigable Waters Protection Rule.- Each of the grid, navigable waters, and Sackett predictions files has the same two columns:
pointid
: the point identifierprobability_wotus
: the WOTUS-ML model score (a number between 0 and 1).
- The training, validation, and test set predictions have the following columns:
pointid
: same as aboveprobability_wotus
: same as abovepredictions
: the rounded WOTUS-ML model score (0 or 1)labels
: the WOTUS decision from the AJD (0 or 1)preds_batch
: the batch number (training and validation predictions only)epoch
: the epoch (training and validation predictions only).
- The file
all_preds.shp
, along with its auxiliary files, is a shapefile containing the combined grid predictions for all rules and grids. The columns of the shapefile’s attribute table are:grid_cell
: the grid cell numberprocess_or
: the grid number (1 through 50)lon
: longitudelat
: latitudeprediction_id
;pointid
: the prediction pointidRapanos_pr
,CWR_prob
,NWPR_prob
: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.district
: the Army Corps of Engineers (ACE) district, using the district abbreviation.
- In addition, the training and validation predictions have columns
ajd_predictions
, the rounded WOTUS-ML model score (0 or 1) and
- Each of the grid, navigable waters, and Sackett predictions files has the same two columns:
ajd_model_predictions
: predictions from AJD-ML. Files includetrain_preds.csv
,val_preds.csv
, andtest_set_predictions.csv
. These contain predictions for the training, validation, and test set, respectively. Variables in each file include:pointid
: the point identifier.probability_ajd
: the AJD-ML model score (number between 0 and 1).ajd_predictions
: the rounded AJD-ML model score (0 or 1).ajd_labels
: the label (0 or 1).epoch
: the epoch for which the other values were calculated. Note thattest_set_predictions.csv
does not have this column as prediction on the test set was done only after model training using the best model
Input layers used for prediction
The files in this category are grid_1.zip
, navigable_water_prediction_points.zip
, Sackett_prediction_points.zip
, and PJD_prediction_points.zip
. Each of these is a zipped directory containing input layers that are fed into WOTUS-ML to produce predictions. All files are formatted as Nx512x512 arrays centered at the point of interest and saved as tiff files. Some files contain a single layer (N=1), but others contain up to 9 layers (N=9). The files are named according to their prediction ids, which are described in the metadata files (see the “Metadata” section below). For more details on the variables, see Table S2 in the Supplementary Materials. The subfolders are:
NAIP
: 4x512x512 arrays containing the Red, Green, Blue and Near Infrared channels from National Agricultural Imagery Program (NAIP) imageryNWI
: 1x512x512 arrays containing wetland types from the National Wetlands Inventory (NWI). The mapping from wetland types to numbers is: estuarine and marine deepwater = 1; estuarine and marine wetland = 2; freshwater emergent wetland = 3; freshwater forested/shrub wetland = 4; freshwater pond = 5; lake = 6; riverine = 7; other = 8.NHD
: 5x512x512 arrays containing features from the National Hydrography Dataset (NHD), including Fcode (the water type feature, e.g. perennial stream or intermittent stream); Path length (the distance of the NHD flowline); stream order; high flow (the maximum flow value for this water segment); and low flow (the minimum flow value for this water segment).DEM
: 1x512x512 arrays containing elevation of the point above sea level, in meters, from the USGS 3-D Elevation Program’s digital elevation model (DEM).PRISM
: 9x512x512 arrays containing 30-year climate normals at the point from the Parameter-elevation Regressions on Independent Slopes Model (PRISM), including long-run averages of minimum, maximum, and mean temperature; mean dew point temperature; minimum and maximum vapor pressure deficit; clear sky and total solar radiation; and cloudiness.PPT
: 1x512x512 arrays containing average annual total precipitation at the point, from PRISM.Ecoregions
: 1x512x512 arrays containing information about the point’s ecoregion from the US EPA Level IV Ecoregions.gNATSGO
: 5x512x512 arrays containing soil information about the point from the gridded National Soil Survey Geographic Database (gNATSGO), including taxonomic class, hydric rating (whether the map unit is a “hydric soil”), flooding frequency (the annual probability of a flood event), ponding frequency (the number of times ponding occurs per year), and water table depth (the shallowest depth, in centimeters, to a wet soil layer).NLCD
: 1x512x512 arrays containing the point’s land cover class, taken from 20 land cover classes from the National Land Cover Database, including open water, ice/snow, four classes of developed land (open, low, medium, and high), barren, three forest classes (evergreen, deciduous, mixed), two scrub classes (dwarf, shrub), four herbaceous classes (grassland, sedge, moss, lichen), two agricultural classes (pasture/hay, cultivated), and two wetland classes (woody, emergent herbaceous).ACE_districts
: 1x512x512 arrays corresponding to which ACE district each point is located in. These are encoded numerically; see2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R
in the code repository for details.states
: 1x512x512 arrays corresponding to which US state, as defined by the Topologically Integrated Geographic Encoding and Referencing System (TIGER)/Line State boundaries, each point is located in. These are encoded numerically; see2_data/inputs/2_process/create_district_and_prediction_ordinal_layers.R
in the code repository for details.rules
: 1x512x512 arrays corresponding to each WOTUS rule. These are stored in subdirectories for each rule:CWR
,NWPR
, andRapanos
, and consist of arrays of a single value, with a value of 1 for Rapanos, a value of 2 for the CWR, and a value of 3 for NWPR.dist_HQ
: 1x512x512 arrays containing the distance, in meters, from the point to the headquarters of the ACE district the point is in.
In addition, grid_1.zip
, navigable_water_prediction_points.zip
, and PJD_prediction_points.zip
contain information about points that should be dropped from the metadata because they are missing from the input layers or are corrupted files, which can occur if the requested area is in an ocean, Great Lake, or restricted military area. These files contain numpy arrays of the points to be dropped. For more details, see the prediction code under 4_dl_models
in the code repository.
Auxiliary data
The contents of the auxiliary files include:
-
text_questions_data.zip
:DeregulatedPointsRapanosToNWPR.csv
: prediction points from the 4 million grid points which are deregulated between Rapanos and NWPR. Column names are the same as column names inall_preds.shp
.-
prediction_point_metadata.csv
: state, district, and distance to headquarters information for the 4 million grid points. Columns are:prediction_id
: the point identifierstate
: the state FIPS codedistrict
: the ACE district, using the district abbreviation.distHQ
: the distance, in meters, from the point to the headquarters of the ACE district the point is in.
-
sample_grid_points.csv
: the 4 million grid points. Columns are:grid_cell
: the grid cell numberprocess_or
: the grid number (1 through 50)lon
: longitudelat
: latitudeprediction_id
: the prediction pointid
testSetWithAJDinfo.csv
: the AJD test set with additional information merged in from the AJD database. Key columns are:jdid
: the id assigned by ACEagency
: the agency making the jurisdictional determinationprojectid
: the project id assigned by ACEdistrictorregion
: the ACE districtjdbasis
: the WOTUS rule used as the basis for the determinationpdflink
: a link to a pdf of the determination, if availablefinalizeddate
: date the determination was finalizedclosuremethod
: Whether the determination required a field visit or notwatname
: the name of the water resource evaluated for the determinationresourcetypes
: the short code describing the resource typeresourcetypedescription
: a longer description of the resource typewateroftheus
: WOTUS decision (Yes or No)cowardincode
: the Cowardin codecowardincategory
: The Cowardin categorycowardindescription
: the description of the Cowardin categorylongitude
: the longitude of the centroid of the water resource (see SM section A.4 for discussion)latitude
: the latitude of the centroid of the water resource (see SM section A.4 for discussion)state
: US state namecounty
: US county namepointid
: the point identifierprob_cnn
: the WOTUS-ML scorepredictions
: the rounded WOTUS-ML model score (0 or 1)labels
: the WOTUS decision (0 or 1)group_all
: indicator for an AJD decided under any rulegroup_rapanos
: indicator for an AJD decided under Rapanosgroup_nwpr
: indicator for an AJD decided under NWPRgroup_cwr
: indicator for an AJD decided under CWRajd_decision
: the WOTUS decision (0 or 1)accuracy
: share of AJDs with rounded WOTUS-ML model score (0 or 1) equal to WOTUS decision (0 or 1)sh_above_score_cutoffs
: share of validation AJDs with WOTUS-ML score above each cutoff inscore_cutoffs_hi
.accuracy_above_score_cutoffs
: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score above each cutoff inscore_cutoffs_hi
. Used to graph accuracy curvesscore_cutoffs_hi
: cutoffs from 0.5-1.0 used byaccuracy_above_score_cutoffs
sh_below_score_cutoffs
: share of validation AJDs with WOTUS-ML score below each cutoff inscore_cutoffs_lo
.accuracy_below_score_cutoffs
: accuracy of WOTUS-ML for validation AJDs with WOTUS-ML score below each cutoff inscore_cutoffs_lo
. Used to graph accuracy curvesscore_cutoffs_lo
: cutoffs from 0.0-0.50 used byaccuracy_below_score_cutoffs
xval
: score cutoffs on x axis for accuracy curveyval
: share of validation AJDs with at least the accuracy inxval
. Used to graph accuracy curve
-
ajd_point_intersections.csv
:pointid
: The AJD point identifiernwi
: boolean; true if the point intersects any NWI polygon; false otherwisenwi_wetland_type
: NA ifnwi == False
, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other)nhd
: boolean; true if the point intersects any NHD polygon; false otherwisenhd_fcode
: NA ifnhd == False
, otherwise the fcode from NHDnavigable_water
: boolean; true if the point is a navigable water as defined in SM section A.5; false otherwisenavigable_water_and_nwi
: bolean; true ifnavigable_water == True
andnwi == True
; false otherwiseiclus_growth
: boolean; true if the point is in an area defined by ICLUS to move from undeveloped to semi-developed, semi-developed to developed, or undeveloped to developed. See SM section A.4 for details
Metadata
-
sackett_metadata.csv
:id
: The Sackett prediction idstate
: the state FIPS codedistrict
: the 3-letter ACE district abbreviationdistHQ
: the distance, in meters, of the point to the ACE district headquarters
-
prediction_point_metadata.csv
:prediction_id
: the prediction point idstate
: the state FIPS codedistrict
: the 3-letter ACE district abbreviationdistHQ
: the distance, in meters, of the point to the ACE district headquarters
Data for displays
The zip files table*_data.zip
, figure*_data.zip
contain data necessary to replicate each of the figures and tables in the main text and SM, and do not appear elsewhere in this replication package. In some cases, files are duplicated from elsewhere in this repository, and in some cases the contents of the zip files are identical. Note there are no zip files corresponding to figures that do not require any data, e.g. figure 1.
-
fig2A_data
,figS6_data
,figS9_data
:test_set_predictions.csv
: See description ofwotus_model_predictions.zip
above.AJD_jds_wet_dry_season_clean_v2.csv
,jds202205312309.csv
: See description oftestSetWithAJDInfo.csv
above.- Note that this figure can also be replicated using
testSetWithAJDInfo.csv
only. Also note that all three zip files contain the same data.
-
fig2B_data
:prediction_point_metadata.csv
,sample_grid_points.csv
,testSetWithAJDinfo.csv
: See file descriptions above.
fig3_data
:prediction_point_metadata.csv
,sample_grid_points.csv
: See file descriptions above.preds.csv
: The 4 million predictions, combined into a single file. See the description ofall_preds.shp
above.
fig4_data
:ID_shapefile_wetlands
: a directory containing the shapefile and auxiliary files for Idaho wetlands from NWI.zoom_areas
:zoom_areas_naip
:.tiff
files containing NAIP imagery for each of the zoom areas in figure 4.zoom_areas.shp
: a shapefile and auxiliary files containing the polygons defining the zoom areas in figure 4.
Sackett_sample_NAIP_tiles.csv
: a file describing the geographic information for the Sackett prediction points. Key fields include:prediction_id
: the Sackett prediction IDlat
: the latitude coordinate of the pointlon
: the longitude coordinate of the point
figS3_data
:prediction_point_metadata.csv
,sample_grid_points.csv
,preds.csv
: See file descriptions above.
figS4_data
:nhdPlusRegionsCombined.shp
: combination of all NHDPlusV2 regions from EPA’s NHD datastreamleve
: stream level in NHD
PRISM_ppt_30yr_normal_4kmM4_annual_asc.tif
: tif of PRISM precipitation. Downloaded from https://prism.oregonstate.edu/normals/nlcd_2019_land_cover_l48_20210604.img
: NLCD 2019 Land Cover. Downloaded from https://www.mrlc.gov/data?f%5B0%5D=category%3ALand%20Cover&f%5B1%5D=category%3Aland%20cover&f%5B2%5D=region%3Aconus&f%5B3%5D=year%3A2019USGSNAIPImagery.tif
: NAIP ImageryNAIPmapping.qgz
,NLCDmapping.qgz
,PRISMmapping.qgz
: QGIS projects used to create their respective maps
table1_data
:sample_grid_points.csv
,test_set_predictions_*.csv
,testSetWithAJDinfo.csv
: See file descriptions above.navigable_comids_wlatlon.txt
:comid
: COMID (stream segment identifier from NHD)latitude
: latitude coordinatelongitude
: longitude coordinategnis_name
: stream name from the USGS Geographic Name Information System
tableS4_data
,tableS6_data
:AJD_jds202205312309_clean.csv
,ajd_point_intersections.csv
,AJD_jds_wet_dry_season_clean_v2.csv
: See file descriptions above.pointid_resourcetype_crosswalk.csv
:pointid
: the AJD pointid.ai_cowardin
: a 9-class categorization of cowardin codes (see table S1)cowardin_numeric
: a numeric encoding ofai_cowardin
cowardin_simple
: a 4-class categorization of cowardin codes into wetland, stream, or other. Note this is not used in the paper.ai_resourcetype
: a 9-class categorization of resource types (see table S2)resource_numeric
: a numeric encoding ofai_resourcetype
tableS5_data
:prediction_point_metadata.csv
,sample_grid_points.csv
,testSetWithAJDinfo.csv
: See file descriptions above.
tableS7_data
:prediction_point_metadata.csv
,sample_grid_points.csv
,testSetWithAJDinfo.csv
: See file descriptions above.nhd_stats_AI_state.csv
:comid
: COMIDlong_comid
: the COMID’s longitudelat_comid
: the COMID’s latitudeftype
: the NHD feature typefcode
: the NHD feature codeintephem
: 1 if ephemeral, 0 otherwisestreamorder
: Stream Orderlengthkm
: Path length in kmSTUSPS
: FIPS state postal code
nhd_stream_miles_by_state.csv
:STUSPS
: 2-character USPS state codelengthmi
: Total stream length, in miles
nwi_acres_by_state.csv
:NAME
: State nameSTUSPS
: 2-character USPS state codeSTATEFP
: State FIPS codenwi_all_acres
: Total NWI acresnwi_wetland_acres
: Total NWI acres in one of the wetland types
tableS8_data
:PWS_Locations_HUC12_2022Q2.xlsx
: list of all public water systems served by water sources within each HUC12HUC_12
: HUC12 regionPWSID
: public water system id of systems served by thehuc12
WBD_HUC12.shp
: shapefile and auxiliary files for the HUC 12 watershed boundary datasethuc12
: HUC12 region
PredictionPointsByHuc12PWSIDNhdNwiPopulationServed.csv
: spatial join of WOTUS-ML prediction points to HUC12 polygons (fromWBD_HUC12
), the public water systems served by said HUC12 (fromPWS_Locations_HUC12_2022Q2
) and the population served by each public water system (fromsdwis_active_years
).prediction_id
: the prediction point idpwsid
: public water system idpopulation_served
: population served by thepwsid
dereg
: indicator if WOTUS-ML predicts the prediction point is regulated under Rapanos, but not regulated under NWPRRapanos_prob
,CWR_prob
,NWPR_prob
: the WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.Rapanos_prediction
,CWR_prediction
,NWPR_prediction
: the rounded WOTUS-ML score for Rapanos, CWR, and NWPR, respectively.nwi
: boolean; true if the point intersects any NWI polygon; false otherwisenwi_wetland_type
: NA ifnwi == False
, otherwise a string describing the wetland type in NWI (estuarine and marine deepwater, estuarine and marine wetland, freshwater emergent wetland, freshwater forested/shrub wetland, freshwater pond, lake, riverine, and other)nhd
: boolean; true if the point intersects any NHD polygon; false otherwisenhd_fcode
: NA ifnhd == False
, otherwise the fcode from NHD
sdwis_active_years.dta
: list of public water systems active in the Environmental Protection Agency’s SDWIS database in each year.pwsid
: public water system idpws_type_code
: public water system type (community water system - CWS; non-transient non-community water system - NTNCWS; transient non-community water system - TNCWS)active
: indicator; 1 if thispwsid
was active in thisyear
year
: calendar year
This dataset contains model outputs that were analyzed to produce the main results of the paper.