Surface water sources, sales, and transfers for California community water systems
Data files
May 21, 2026 version files 383.48 KB
-
Data_dictionary.csv
34.98 KB
-
README.md
7.73 KB
-
Supplemental_data_methods.pdf
176.65 KB
-
SWsources_final_dryad_data.csv
164.12 KB
Jun 01, 2026 version files 382.97 KB
-
Data_dictionary_V2.csv
34.98 KB
-
Final_dryad_data_V2.csv
161.84 KB
-
README.md
7.58 KB
-
Supplemental_data_methods_V2.pdf
178.56 KB
Abstract
This dataset documents the surface water sources, including sales and transfers from other utilities or entities, of California community water systems. The data covers 849 community water systems as well as 3 non-transient non-community water systems and 6 non-public water systems that provide wholesale water to one or more community water systems and their surface water sources. Surface water sources include springs (regardless of the water type classification of those springs by the California State Water Resources Control Board, i.e., groundwater or surface water), streams, rivers, canals, and lakes in addition to entities that are not public water systems or wholesale drinking water providers but do sell water to community water systems such as local governments and independent special districts.
Dataset DOI: 10.5061/dryad.z08kprrw2
Description of the data and file structure
This dataset documents the surface water sources, including sales and transfers from other utilities or entities, of California community water systems. The data covers 849 community water systems as well as 3 non-transient non-community water systems and 6 non-public water systems that provide wholesale water to one or more community water systems and their surface water sources. Surface water sources include springs (regardless of the water type classification of those springs by the California State Water Resources Control Board, i.e., groundwater or surface water), streams, rivers, canals and lakes in addition to entities that are not public water systems or wholesale drinking water providers but do sell water to community water systems such as local governments and independent special districts.
Files and variables
File: Data_dictionary_V2.csv
Description: Data dictionary
File: Supplemental_data_methods_V2.pdf
Description: Data methodology description
File: Final_dryad_data_V2.csv
Description: Data csv
Number of variables: 25
Number of rows (not counting column headers): 1096
Missing data codes: NA (see data dictionary for specific usage by variable)
Variables (See data dictionary and data methodology description for more information)
- system_name: Name of water system
- pwsid: Unique identifier for included water systems
- county: Principal county served
- provider_type: Whether water system provides retail or wholesale water or both
- is_wholesaler: Whether water system provides wholesale water
- state_classification: California state classification for water system type
- pop_served: Retail population served
- gw_access: Whether a system has direct groundwater access (not counting consecutive connections)
- gw_compliant_w_dw_standards: Whether a system with direct groundwater access has one or more wells in compliance with regulatory standards
- num_sw_sources: Number of surface water sources for this system
- source_name: Name of water source
- source_id: Unique identified for sources
- source_type: Water type of source
- num_systems_served: Number of water systems served by this water system including itself
- connection_id: Unique identifier for system-source connection
- purchased: Whether source is secured from another water system or other supplier (e.g. county, irrigation district, USBR, DWR) via sale or contract
- x2020_ccr_percent: Percent of water systems supply this source represented according to 2020 CCR
- x2021_percent: Percent of water systems supply this source represented according to 2021 CCR
- x2020_uwmp_percent: Percent of water systems supply this source represented according to 2020 UWMP
- x2022_percent: Percent of water systems supply this source represented according to 2022 CCR
- x2023_percent: Percent of water systems supply this source represented according to 2023 CCR
- average_source_usage: Average of 2020-2023 usage columns, for systems without data imputed assuming equal usage of all surface water sources plus 1 source if system has groundwater access
- average_source_method: Whether average_source_usage value is calculated based on data or imputed
- swp_text: Whether source is associated with the State Water Project (SWP)
- cvp_text: Whether source is associated with the Central Valley Project (CVP) (
Code/software
ChatGPT4.0 mini, R version 4.5.2, R Studio version 2025.09.0+387, Microsoft Excel version 16.101
Methods for processing data
The base dataset was compiled by triangulating between two primary sources: 1) water sources extracted from Consumer Confidence Reports and Urban Water Management Plans using ChatGPT4.0 mini; and 2) the 2025 SAFER clearinghouse dataset. Additional variables were then summarized from other sources by water system ID and joined to the base dataset. See included supplemental methods for further details.
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- CA SDWIS (Safe Drinking Water Information System): State Water Resources Control Board Drinking Water Watch dashboard (https://sdwis.waterboards.ca.gov/PDWW/). Query for all active public water systems. Retrieved November 18, 2025.
- Consumer Confidence Reports (CCRs): Annual Consumer Confidence Reports retrieved from California’s Public Drinking Water Watch dashboard (https://sdwis.waterboards.ca.gov/PDWW/). All available 2022 and 2023 reports for the initial system list were webscraped in September 2024. Then in February 2025 reports for 2020 and 2021 were collected manually where available for systems without either a 2022 or 2023 report available.
- EPA SDWIS (Safe Drinking Water Information System): U.S. Environmental Protection Agency SDWIS Federal Data Warehouse (SFDW) (https://ordspub.epa.gov/ords/sfdw_rest/f?p=108:9:::NO::P9_REPORT:VIO). Query for all active public water systems regulated by California in quarter 1 2025. Retrieved November 18, 2025.
- EDT library: EDT (Electronic Data Transfer) Library and Water Quality Analyses Data. (https://www.waterboards.ca.gov/drinking_water/certlic/drinkingwater/EDTlibrary.html) SDWIS4.tab (data from January 1, 2023 to August 19, 2025). Retrieved August 26, 2025.
- SAFER clearinghouse: 2025 SAFER clearinghouse dataset received directly from the State Water Resources Control Board Division of Drinking Water on April 24, 2025.
- Urban Water Management Plans (UWMPs): 2020 Urban Water Management Plans retrieved from the public WUEdata portal (https://wuedata.water.ca.gov/uwmp_plans.asp?cmd=2020). Manually collected between December 2024 and January 2025.
Data limitation
Users should be aware of two limitations impacting the accuracy of the dataset. First, ChatGPT (version 4.0mini) was used to extract information on water sources from 2022 and 2023 Consumer Confidence Reports and from 2020 Urban Water Management Reports. This approach was piloted for accuracy (>84%) before it was executed and after implementing these methods we manually returned to the source documents for many water systems (>50% of the sample) to double check details as we merged and compiled additional information from the various data sources where information conflicted. Nonetheless we know that the AI methods are not without errors. Second, the vast majority of data sources used to derive this dataset are self-reported by water systems. There can be challenges with the accuracy of this self-reported arising from human error, formatting, and unit discrepancies etc. Moreover, differing levels of detail provided by different water systems, particularly in narrative reports like Consumer Confidence Reports and Urban Water management Plans, can impact the accuracy and comparability of the information between systems. See supplemental methods for more details on the data compilation methodology and limitations.
Changes after May 21, 2026: The "V2" version of the data and documentation were updated on May 29th to correct one source (source_168) that was incorrectly labeled as not purchased but should have been and to clarify the definition of two variables (Swp_text and Cvp_text).
