Oates, Christopher1 ; Grieger, Khara1 ; Emanuel, Ryan2 ; Nelson, Natalie 1

Published Mar 27, 2025 on Dryad. https://doi.org/10.5061/dryad.7d7wm3858

In this study, we investigated: Are water quality monitoring stations proportionally distributed across communities of varying social vulnerability? We specifically focus on nutrient monitoring of surface waters in the South Atlantic-Gulf region, a water-rich area with diverse land uses and communities spanning the social vulnerability spectrum. We used 2018-2022 data from the U.S. Geological Survey (USGS) National Water Information System and U.S. Environmental Protection Agency Storage and Retrieval database to compare station locations to census tract-scale metrics from the U.S. Center for Disease Control Social Vulnerability Index (SVI) and hydrography from the USGS. Statistical analyses revealed a significant disparity in the distribution of active monitoring station placements, with more monitoring stations in lower vulnerability areas and fewer in highly vulnerable areas. Stations were also clustered in areas of similar SVI values; areas were less likely to be monitored if they were near areas of differing SVI.

https://doi.org/10.5061/dryad.7d7wm3858

Files and variables:

The zipped files contain all of the individual files that are required to open, project, and manipulate shapefiles. All of the shape (.shp) files used in this study contain the geometry and attributes of geospatial features (e.g., points, lines, polylines, polygons). The zipped file bundle contains the main file .shp and companion files including: .cpg, .dbf, .prj, .qmd, and .shx.

The main .shp file can be opened and analyzed by Python, R, and many other programming languages, and open-source geospatial software such as QGIS, SAGA GIS, GRASS GIS, GeoDa, etc. These .shp files were the base of much of this study’s analysis.

merged.zip: contains data from both the Centers for Disease Control and Prevention Social Vulnerability Index (SVI) and a series of geospatial processes conducted in both Python and QGIS. This analysis uses the nationwide, census tract-scale 2022 version of the SVI; see https://svi.cdc.gov/map25/data/docs/SVI2022Documentation_ZCTA.pdf for data documentation and information on all column headers. The first set of columns (OBJECTID to FID) are all from the CDC SVI dataset. The remaining columns were created in the analysis (see oates_et_al_nature_water.ipynb).

svi_SAG_full.zip is the SVI data for all 9 states (Alabama, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Virginia) in the South Atlantic-Gulf region.

[state abbreviation]_counts.zip (e.g., FLcounts.zip, NCcounts.zip) are shapefiles of study area census tracts for each state in the South Atlantic-Gulf region (Louisiana and Tennessee were excluded since both states have so few monitoring stations inside of the South Atlantic-Gulf region). The shapefiles contain a column named "sts" that represents the number of active monitoring stations in each of that state's census tracts.

The aforementioned files all share the following column and variable information.

The following column headings are all original to the 2022 CDC/ATSDR Social Vulnerability Index. Detailed explanation of what each column denotes can be found at: https://svi.cdc.gov/map25/data/docs/SVI2022Documentation_ZCTA.pdf

OBJECTID: Object ID (not in documentation, database-specific identifier)
ST: State-level FIPS code
STATE: State name
ST_ABBR: State abbreviation
STCNTY: County-level FIPS code
COUNTY: County name
FIPS: Tract-level geographic identification code
LOCATION: Text description of tract, county, state
AREA_SQMI: Tract area in square miles
E_TOTPOP: Total population estimate
M_TOTPOP: Margin of error for total population estimate
E_HU: Estimate of total housing units
M_HU: Margin of error for total housing units
E_HH: Estimate of total households
M_HH: Margin of error for total households
E_POV150: Estimate of persons below 150% of the poverty line
M_POV150: Margin of error for persons below 150% of the poverty line
EP_POV150: Percentage of persons below 150% of the poverty line
MP_POV150: Margin of error for percentage of persons below 150% of the poverty line
E_UNEMP: Estimate of unemployed civilians age 16+
M_UNEMP: Margin of error for unemployed civilians age 16+
EP_UNEMP: Percentage of unemployed civilians age 16+
MP_UNEMP: Margin of error for percentage of unemployed civilians age 16+
E_HBURD: Estimate of cost-burdened housing units (income < $75k spending 30%+ on housing)
M_HBURD: Margin of error for cost-burdened housing units (income < $75k)
EP_HBURD: Percentage of housing cost-burdened households
MP_HBURD: Margin of error for housing cost-burdened households
E_NOHSDP: Estimate of persons age 25+ with no high school diploma
M_NOHSDP: Margin of error for persons with no high school diploma
EP_NOHSDP: Percentage of persons 25+ with no high school diploma
MP_NOHSDP: Margin of error for percentage of persons with no high school diploma
E_UNINSUR: Estimate of uninsured persons in civilian noninstitutionalized population
M_UNINSUR: Margin of error for uninsured estimate
EP_UNINSUR: Percentage of uninsured persons
MP_UNINSUR: Margin of error for percentage of uninsured persons
E_AGE65: Estimate of persons age 65 and older
M_AGE65: Margin of error for persons age 65 and older
EP_AGE65: Percentage of persons aged 65 and older
MP_AGE65: Margin of error for persons aged 65 and older
E_AGE17: Estimate of persons age 17 and younger
M_AGE17: Margin of error for persons age 17 and younger
EP_AGE17: Percentage of persons aged 17 and younger
MP_AGE17: Margin of error for persons aged 17 and younger
E_DISABL: Estimate of persons with a disability in civilian noninstitutionalized population
M_DISABL: Margin of error for persons with a disability
EP_DISABL: Percentage of persons with a disability
MP_DISABL: Margin of error for percentage of persons with a disability
E_SNGPNT: Estimate of single-parent households with children under 18
M_SNGPNT: Margin of error for single-parent households
EP_SNGPNT: Percentage of single-parent households with children
MP_SNGPNT: Margin of error for single-parent households
E_LIMENG: Estimate of persons age 5+ who speak English 'less than well'
M_LIMENG: Margin of error for limited English estimate
EP_LIMENG: Percentage of persons (5+) who speak English 'less than well'
MP_LIMENG: Margin of error for limited English proficiency
E_MINRTY: Estimate of racial/ethnic minority persons (non-White, non-Hispanic)
M_MINRTY: Margin of error for racial/ethnic minority estimate
EP_MINRTY: Percentage of minority population
MP_MINRTY: Margin of error for minority population
E_MUNIT: Estimate of housing units in structures with 10+ units
M_MUNIT: Margin of error for multi-unit housing estimate
EP_MUNIT: Percentage of housing in multi-unit structures
MP_MUNIT: Margin of error for multi-unit housing
E_MOBILE: Estimate of mobile homes
M_MOBILE: Margin of error for mobile homes
EP_MOBILE: Percentage of mobile homes
MP_MOBILE: Margin of error for mobile homes
E_CROWD: Estimate of crowded households (more people than rooms)
M_CROWD: Margin of error for crowded households
EP_CROWD: Percentage of crowded households
MP_CROWD: Margin of error for crowded households
E_NOVEH: Estimate of households with no vehicle available
M_NOVEH: Margin of error for households with no vehicle
EP_NOVEH: Percentage of households with no vehicle
MP_NOVEH: Margin of error for households with no vehicle
E_GROUPQ: Estimate of persons in group quarters
M_GROUPQ: Margin of error for persons in group quarters
EP_GROUPQ: Percentage of persons in group quarters
MP_GROUPQ: Margin of error for persons in group quarters
EPL_POV150: Percentile rank: percent below 150% poverty
EPL_UNEMP: Percentile rank: unemployment rate
EPL_HBURD: Percentile rank: housing cost burden
EPL_NOHSDP: Percentile rank: no high school diploma
EPL_UNINSU: Percentile rank: uninsured population
SPL_THEME1: Summed percentile of socioeconomic indicators
RPL_THEME1: Rank of socioeconomic vulnerability theme
EPL_AGE65: Percentile rank: age 65+ population
EPL_AGE17: Percentile rank: age 17 and younger population
EPL_DISABL: Percentile rank: population with disability
EPL_SNGPNT: Percentile rank: single-parent households
EPL_LIMENG: Percentile rank: limited English proficiency
SPL_THEME2: Summed percentile of household characteristics
RPL_THEME2: Rank of household characteristics theme
EPL_MINRTY: Percentile rank: minority population
SPL_THEME3: Summed percentile of minority status indicators
RPL_THEME3: Rank of racial and ethnic minority status theme
EPL_MUNIT: Percentile rank: housing in multi-unit structures
EPL_MOBILE: Percentile rank: mobile homes
EPL_CROWD: Percentile rank: crowded households
EPL_NOVEH: Percentile rank: households with no vehicle
EPL_GROUPQ: Percentile rank: persons in group quarters
SPL_THEME4: Summed percentile of housing type and transport
RPL_THEME4: Rank of housing and transportation theme
SPL_THEMES: Summed percentile of all themes
RPL_THEMES: Overall vulnerability rank
F_POV150: Flag for high vulnerability (90th percentile) in persons below 150% poverty
F_UNEMP: Flag for high vulnerability in unemployment rate
F_HBURD: Flag for high vulnerability in housing cost burden
F_NOHSDP: Flag for high vulnerability in education (no high school diploma)
F_UNINSUR: Flag for high vulnerability in uninsured population
F_THEME1: Sum of flags for socioeconomic status theme
F_AGE65: Flag for high vulnerability in population aged 65 and older
F_AGE17: Flag for high vulnerability in population aged 17 and younger
F_DISABL: Flag for high vulnerability in population with disability
F_SNGPNT: Flag for high vulnerability in single-parent households
F_LIMENG: Flag for high vulnerability in limited English proficiency
F_THEME2: Sum of flags for household characteristics theme
F_MINRTY: Flag for high vulnerability in minority population
F_THEME3: Sum of flags for racial and ethnic minority status theme
F_MUNIT: Flag for high vulnerability in multi-unit housing
F_MOBILE: Flag for high vulnerability in mobile homes
F_CROWD: Flag for high vulnerability in crowded households
F_NOVEH: Flag for high vulnerability in households with no vehicle
F_GROUPQ: Flag for high vulnerability in group quarters population
F_THEME4: Sum of flags for housing type and transportation theme
F_TOTAL: Total number of flags across all themes
E_DAYPOP: Estimated daytime population from LandScan 2021
E_NOINT: Estimate of households without internet subscription
M_NOINT: Margin of error for households without internet subscription
E_AFAM: Estimate of Black/African American, not Hispanic or Latino population
M_AFAM: Margin of error for Black/African American estimate
E_HISP: Estimate of Hispanic or Latino population
M_HISP: Margin of error for Hispanic or Latino estimate
E_ASIAN: Estimate of Asian, not Hispanic or Latino population
M_ASIAN: Margin of error for Asian population
E_AIAN: Estimate of American Indian/Alaska Native, not Hispanic or Latino
M_AIAN: Margin of error for American Indian/Alaska Native estimate
E_NHPI: Estimate of Native Hawaiian/Other Pacific Islander, not Hispanic or Latino
M_NHPI: Margin of error for NHPI estimate
E_TWOMORE: Estimate of two or more races, not Hispanic or Latino
M_TWOMORE: Margin of error for two or more races estimate
E_OTHERRAC: Estimate of some other race, not Hispanic or Latino
M_OTHERRAC: Margin of error for some other race estimate
EP_NOINT: Percentage of households without internet subscription
MP_NOINT: Margin of error for percentage without internet subscription
EP_AFAM: Percentage of Black/African American, not Hispanic or Latino
MP_AFAM: Margin of error for percentage of Black/African American
EP_HISP: Percentage of Hispanic or Latino population
MP_HISP: Margin of error for percentage of Hispanic or Latino
EP_ASIAN: Percentage of Asian, not Hispanic or Latino
MP_ASIAN: Margin of error for percentage of Asian
EP_AIAN: Percentage of American Indian/Alaska Native, not Hispanic or Latino
MP_AIAN: Margin of error for percentage of American Indian/Alaska Native
EP_NHPI: Percentage of Native Hawaiian/Other Pacific Islander, not Hispanic or Latino
MP_NHPI: Margin of error for percentage of NHPI
EP_TWOMORE: Percentage of two or more races, not Hispanic or Latino
MP_TWOMORE: Margin of error for percentage of two or more races
EP_OTHERRA: Percentage of some other race, not Hispanic or Latino
MP_OTHERRA: Margin of error for percentage of some other race

The following columns were created at varying stages of our analysis:

Shape_Leng: length of waterways and water body perimeters in degrees
Shape_Area: area of census tracts in degrees2
FID: unique tract identifier
SVI_characterization: SVI decile characterization
flowline_length_km: length of waterways and water body perimeters in degrees
area_sqkm: area of census tracts in kilometers2
pop_den_sqkm: population density of census tracts in person/kilometers2
FLLR_tract: Flowline length ratio of tract in kilometers/kilometers2
sts: the number of water quality monitoring stations in a tract

monitored.csv and unmonitored.csv are subset from the merged.zip dataset and include lists of all monitored and unmonitored study area census tracts, respectively. Monitored tracts have at least one active nutrient monitoring station, and unmonitored tracts have 0 stations. Since they are both subsets of merged.zip, they share all of the column and variable information of merged.zip.

SAGR_stations_18_22.csv contains all of the water quality monitoring stations in the contiguous United States and Washington D.C. that recorded at least twenty concentration observations from at least twenty sampling activities between January 1st, 2018, and December 31st, 2022; this approximates seasonal data collection (i.e., four observations per year). The column headings are original to the EPA Water Quality Portal Portal (https://www.waterqualitydata.us/) download. See https://www.waterqualitydata.us/portal_userguide/ for additional metadata and column descriptions.

The column and variable definitions and descriptions are as follows:

OrganizationIdentifier: A code used to uniquely identify a specific organization or business.
OrganizationFormalName: The official legal name of the organization.
MonitoringLocationIdentifier: A unique code or name used to identify a sampling location.
MonitoringLocationName: The name given by the organization for the place where they collect data.
MonitoringLocationTypeName: A description of the kind of place being monitored (like a stream, well, etc.).
MonitoringLocationDescriptionText: A written description of the sampling location.
HUCEightDigitCode: An 8-digit code that identifies the watershed or hydrologic unit the site is in.
DrainageAreaMeasure.MeasureValue: The size of the land area that drains to the location, in specific units.
DrainageAreaMeasure.MeasureUnitCode: The unit used to measure the drainage area (like square kilometers or miles).
ContributingDrainageAreaMeasure.MeasureValue: The part of the drainage area that actually contributes flow to the location.
ContributingDrainageAreaMeasure.MeasureUnitCode: The unit used for that contributing drainage area.
LatitudeMeasure: How far north or south the site is from the equator.
LongitudeMeasure: How far east or west the site is from the prime meridian.
SourceMapScaleNumeric: The scale of the map used to determine the coordinates (like 1:24,000).
HorizontalAccuracyMeasure.MeasureValue: How accurate the horizontal (latitude/longitude) location is, in a specific unit.
HorizontalAccuracyMeasure.MeasureUnitCode: The unit used to measure the horizontal accuracy.
HorizontalCollecitonMethodName: The method used to collect latitude and longitude (like GPS).
HorizontalCoordinateReferenceSystemDatumName: The coordinate system used to define the location (like NAD83 or WGS84).
VerticalMeasure.MeasureValue: How high or low the site is above sea level, in specific units.
VerticalMeasure.MeasureUnitCode: The unit used to measure the vertical height (like meters or feet).
VerticalAccuracyMeasure.MeasureValue: How accurate the vertical elevation is.
VerticalAccuracyMeasure.MeasureUnitCode: The unit for measuring that vertical accuracy.
VerticalCollectionMethodName: The method used to collect the elevation.
VerticalCoordinateReferenceSystemDatumName: The reference system used for elevation (like NAVD88).
CountryCode: A code representing the country (like "US").
StateCode: A code representing the U.S. state or territory.
CountyCode: A code representing the county.
AquiferName: The name of the underground layer of water (if it’s a well).
***LocalAqfrName: The local name of aquifers
FormationTypeText: The name of the main type of rock or soil where the well is completed.
AquiferTypeName: What kind of aquifer it is — like confined or unconfined.
ConstructionDateText: When the well was built (could just be the year).
WellDepthMeasure.MeasureValue: Total depth of the well from the surface.
WellDepthMeasure.MeasureUnitCode: The unit used to measure that well depth.
WellHoleDepthMeasure.MeasureValue: Depth of the drilled hole at the time of well completion.
WellHoleDepthMeasure.MeasureUnitCode: The unit used to measure that hole depth.
ProviderName: The name of the database that gave the data (like WQX or NWIS).

***LocalAqfrName: was not listed on https://www.waterqualitydata.us/portal_userguide/, but it likely refers to the local/colloquial names for aquifers.

state_boundaries.zip contains the state boundaries for Alabama, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Virginia. Data from the U.S. Census Bureau (https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html).

WBDHU2.zip is the USGS-defined South Atlantic-Gulf region's watershed boundary (https://apps.nationalmap.gov/downloader/).

Code/software:

All of the .zip and .csv files must be in the same file directory in order to fully run oates_et_al_nature_water.ipynb (Python Version 3.10.12 and QGIS version 3.32.2 Lima) and/or oates_et_al_nature_water.R (R Version 4.3.1).

In the Python script (.ipynb), the packages used are geopandas, pandas, numpy, matplotlib.pyplot, matplotlib.patches, matplotlib.cm, matplotlib.colors, matplotlib.lines, pysal, esda, folium, glob, sys, matplotlib, rasterio, libpysal, splot, adjustText, contextily, json, rtree, math, scipy.stats, seaborn, seaborn.objects, statsmodels.api, plotly.graph_objs, plotly.express, pointpats.quadrat_statistics, requests, and io.

In the R script (.R), the packages used are tidyverse, janitor, forcats, lubridate, dplyr, caret, readr, sf, ggplot2, gridExtra, gghalves, ggdist, patchwork, extrafont, and showtext.

Access and sharing information:

All geospatial data used in our analyses are freely available online from U.S. government agencies (U.S. Geological Survey, U.S. Centers for Disease Control & Prevention, and U.S. Environmental Protection Agency).

Dataset: Surface waters in socially vulnerable areas are disproportionately under-monitored for nutrients in the U.S. South Atlantic-Gulf Region

Data files

Abstract

Files and variables:

Code/software:

Access and sharing information:

Dataset: Surface waters in socially vulnerable areas are disproportionately under-monitored for nutrients in the U.S. South Atlantic-Gulf Region

Data files

Abstract

README: Dataset: Surface waters in socially vulnerable areas are disproportionately under-monitored for nutrients in the U.S. South Atlantic-Gulf Region

Files and variables:

Code/software:

Access and sharing information: