Data and code from: Geologic history explains freshwater fish species richness across the conterminous USA
Data files
Oct 16, 2025 version files 9.33 MB
-
FinalData.csv
874.84 KB
-
README.md
6.88 KB
-
Supplementary_File_1.html
3.03 MB
-
SupplementaryData.csv
5.41 MB
Abstract
Aim: Freshwater fishes comprise over 20% of vertebrate biodiversity despite occupying <1% of the Earth’s surface. However, species richness differs substantially among river basins. Fundamentally, richness patterns can be explained by spatial variation in diversification rates, evolutionary time, and habitat capacities, which are in turn shaped by landscape change over geologic timescales. To test how geologic disturbances have influenced the accumulation of freshwater fish biodiversity, we hypothesized species richness would be (1) ordered by regional geologic history, (2) associated with high or intermediate river capture rates, (3) higher in assemblages with older evolutionary origins, and (4) positively associated with stream size.
Time period: 2008-2019.
Location: Conterminous United States (USA).
Major Taxa: Freshwater fishes.
Methods: We analyzed native species richness from a spatially representative survey of 5,321 fish assemblages at 3,609 sites. Geologic history was determined from surrogates of tectonic activity, glaciation, sea levels, and river capture over the last 66 million years, which were paired with previously published evolutionary time estimates. Hypotheses were tested with spatial linear models.
Results: All hypotheses were at least partially supported. (1) Rank-order richness matched hypothesized effects of geologic disturbances on evolutionary time and diversification rates. (2) Richness peaked in lowlands with high putative river capture rates. (3) Richness increased with evolutionary time at broad scales, but this relationship was weak and influenced by non-teleost taxa. (4) Richness largely increased with stream size.
Overall, the tectonically active western USA exhibited lower richness, weaker effects of stream size, and a greater share of young lineages compared to the more geologically stable eastern USA, especially unglaciated lowlands within the Mississippi Basin.
Main conclusions: We demonstrate that deep-time processes leave a persistent mark on fish species richness. Thus, accounting for geologic history can improve assessments of freshwater biodiversity and biological condition in the USA and beyond.
Dataset DOI: 10.5061/dryad.sqv9s4nh3
Description of the data and file structure
We compiled fish assemblage data from the National Rivers and Streams Assessment (NRSA) conducted by the US Environmental Protection Agency, spanning 5,321 sampling events at 3,609 unique sites. These data were joined with information on the native status of each fish species at each site, the evolutionary time of each species, and the geologic history of each site. The final dataset was analyzed using spatial linear models.
Files and variables
File: Supplementary_File_1.html
Description:
This Quarto document contains all the R code necessary to create, visualize, and analyze our dataset. The backbone of our dataset is fish assemblage data from the EPA National Rivers and Streams Assessment (NRSA), which is publicly available through the finsyncR package.
Users can choose to repeat all the necessary steps to re-create the dataset in their R environment by downloading supplementary data (SupplementaryData.csv; see below) and following directions in the first four sections of the document (Load libraries, Fish data preparation, Glaciation and topography data preparation, Final data preparation). Alternatively, users can download the pre-processed final data file (FinalData.csv; see below) and import it directly into their R environment using code in the fifth section of the document (Final dataset). The remaining sections reproduce Figures 2-6 and key model outputs from the main text.
File: SupplementaryData.csv
Description:
This file contains supplementary information that complements fish assemblage data from the EPA National Rivers and Streams Assessment (NRSA). Specifically, we determined the native status and evolutionary time of each species within each sample based on existing data sources. This information was used to generate the final dataset in R, which can be reproduced using code provided in our Quarto document (Supplementary_File_1.html; see above). All missing values are blank cells.
Variables
- SampleID: Unique identifier for each fish assemblage sample from the EPA National Rivers and Streams Assessment (NRSA).
- Species_NRSA: Taxonomic identifier (Genus.species) for fish species within each sample.
- HUC8: 8-digit hydrologic unit code associated with each sample from the USGS National Hydrography Dataset, which was the spatial unit used to create Designation (see below).
- Designation: Whether a given species within a sample was "Native" or "NonNative", according to various sources described in Source_Designation (see below).
- Source_Designation: The data source used to create Designation. Most species were classified using the USGS Non-Indigenous Species database (NAS), but NatureServe and other sources were used when needed.
- EvoTime: Evolutionary time of each native species in millions of years, according to various sources described in Source_EvoTime (see below). Non-native species were left as blank cells.
- Source_EvoTime: The data source used to create EvoTime. Most native species were classified using data from Miller and Roman-Palacios (2021), but several other studies were used for specific taxa. Non-native species were left as blank cells.
File: FinalData.csv
Description:
This file contains the pre-processed dataset that summarizes fish assemblage data from the EPA National Rivers and Streams Assessment (NRSA), and links it to other data sources that describe the topography and geologic history of sites. This file can be directly imported into R, and code provided in our Quarto document can then be used to reproduce all plots and analyses from the main text (Supplementary_File_1.html; see above). All missing values are blank cells.
Variables
- SampleID: Unique identifier for each fish assemblage sample from the EPA National Rivers and Streams Assessment (NRSA).
- SiteNumber: Unique identifier for each NRSA site. Note that some sites were sampled more than once.
- Latitude_dd: Latitude of the site in decimal degrees.
- Longitude_dd: Longitude of the site in decimal degrees.
- NARS_Ecoregion: Ecoregion associated with each site from the EPA National Aquatic Resource Surveys.
- WettedWidth: Average stream wetted width in meters associated with the sample.
- COMID: Stream segment identifier associated with each site from the USGS National Hydrography Dataset.
- HUC2: Regional 2-digit hydrologic unit code associated with each site from the USGS National Hydrography Dataset.
- Richness_Native: Native species richness within the sample, excluding all non-native species. Derived from Designation in the SupplementaryData.csv file (see above).
- Richness_Total: Total species richness within the sample, including non-native species.
- EvoMean: Average evolutionary time of native species within the sample in millions of years. Derived from EvoTime in the SupplementaryData.csv file (see above).
- Elev_site: Elevation of the site in meters, obtained from the elevatr package.
- Relief: Topographic relief in meters (max-min elevation) within a 1-km buffer around the site, estimated using raster data from the elevatr package.
- Ice: Whether site was "Glaciated" or "Unglaciated" during the maximum extent of glaciation.
- Tectonic: Whether site was tectonically "Active" or "Stable" over the last 66 million years.
- SeaLevel: Whether site was "Flooded" or "Unflooded" by high sea levels over the last 66 million years.
- Topography: Whether site was a "Lowland" (Elev_site < 250m) or "Highland" (Elev_site > 250m).
- History: Combined geologic history of the site based on Tectonic, Ice, Topography, and SeaLevel.
Code/software
All analysis was conducted in R version 4.4.0 and used the following twelve R packages, which are also listed in our Quarto document (see Supplementary_File_1.html):
tidyverse, finsyncR, elevatr, sf, terra, tigris, parallel, spmodel, emmeans, ggridges, FishLife, archetypes.
Access information
Data was derived from the following sources, which are all publicly available:
