Planform change and Fundulus SNP data for small watersheds in South Mississippi and Louisiana
Data files
Apr 08, 2024 version files 399.08 MB
Abstract
Fluvial geomorphic processes and the resulting patterns of landform morphogenesis affect the distribution and connectivity of habitat patches for aquatic organisms. Human alterations to fluvial geomorphic processes may affect local habitat quality and stability, and affect connectivity of habitat patches by altering the distribution, supply, and movement of landform-generating materials. This dataset examines 17 watersheds in south Mississippi and southeastern Louisiana and was used in preparation of a manuscript addressing the hypothesis that elevated planform movement, indicative of advanced fluvial erosion, would cause fragmentation among populations of a headwater specialist (Blackspotted Topminnow Fundulus olivaceus). The dataset includes numerous spatial features derived from the NHD+ dataset used in planform measurements, spatial features digitized from NAPP and NAIP aerial imagery measuring planform characteristics and dynamics, additional metrics of each watershed, and a population genetics dataset of single nucleotide polymorphisms (SNPs) for multiple individuals at multiple sites per watershed. Associated code to recreate all analyses in the manuscript is provided.
README: Planform Change and Fundulus olivaceus SNP Data for Small Watersheds in South Mississippi and Louisiana
https://doi.org/10.5061/dryad.nzs7h44xg
Name: Loren Stearman
Institution: University of Southern Mississippi
Email: Loren.Stearman@usm.edu
Name: Jake Schaefer
Institution: University of Southern Mississippi
Email: Jake.Schaefer@usm.edu
Dataset Overview
This dataset contains the data and code required to replicate analyses in Stearman and Schaefer (in review), testing the hypothesis that the rate of geomorphic activity in a river, as inferred from planform dynamics, affects the rate of gene flow among populations of a headwater specialist, Blackspotted Topminnow (Fundulus olivaceus). Data cover 17 watersheds in southern Mississippi and southwest Louisiana. Geospatial data include planform metrics (channel widths and centerlines) for multiple 1km subsample reaches per watershed, for streams >100km2 in drainage area, and two time periods (early 1990s and 2020s) as well as auxiliary files necessary for analyses. Additional watershed metrics are included for use in analyses. Genetic data include a dataset of single nucleotide polymorphisms (SNPs), and various processed results from this dataset to facilitate analyses. Analysis of planform metrics found that while watersheds were generally similar in the directionality of planform behavior over time (narrowing and becoming more sinuous, with measurable displacement), the rate of change varied considerably, suggesting differing degrees of watershed responsiveness to human perturbations (channelization, mining, etc) and natural perturbations (a large flood event in the 1980s). Metrics of genetic differentiation and heterozygosity showed significant relationships with metrics of watershed geomorphic change. Analysis of population structure revealed that while geographic structuring was a strong driver of genetic structure, more variation existed in more geomorphically active watersheds, and several localities appear to have ancestral origin from an adjacent watershed. Examination of patterns of heterozygosity found evidence in three of five cases consistent with population bottlenecks in the host watershed, and probable recolonization (founder effect) from the presumed donor watershed.
Dates of Data Collection
A. NAPP Aerial Imagery: 1995-1998
B. NAIP Aerial Imagery: 2019-2020
C. Extraction of planform metrics: 2019-2023
D. Extraction of other watershed-level metrics: 2021-2023
E. Fish tissue collections: 2020 - 2022
F. Genetic sequencing data: 2021-2022
Data Spatial Scope
Data were collected in 17 small to medium-sized watersheds in southwest Mississippi and southern Louisiana. Latitudes and longitudes for tissue collection localities are available in the shapefile spatial/Fundulus_Sites_6509.shp. Catchment polygons for watersheds are available in the shapefile spatial/Catchments_6509. Planform metrics were restricted to randomly sampled reaches for streams >100km2 in drainage area.
Funding
Fish tissue collections and sequecning and geospatial data extraction and analysis were supported by a grant from the National Science Foundation (DEB-1556778) and from the U.S. Army Corps of Engineers (contract W912HZ21C0064).
Ethics Approval
Tissue collections were conducted under IACUC proocol 15102701.1, granted by the University of Southern Mississippi. Collections activities were performed under collections permit numbers 031191 and 0311201, granted by the Mississippi Department of Wildlife, Fisheries, and Parks, and WDP-22-090, granted by the Louisiana Department of Wildlife and Fisheries.
Sharing/Access information
Sharing/Access
This work is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license.
Files analytical_script.R, structure_pre_processor.R, and vcf_funcs.R are copyright under a GNU GPLv3 license.
Related Data Sources
USGS Earth Explorer. https://earthexplorer.usgs.gov/
Mississippi Automated Resource Information System (MARIS). https://maris.mississippi.edu/
Stearman, L. W., and J. F. Schaefer. In Review. Altered metapopulation dynamics in a headwater specialist in geomorphically dynamic watersheds. Nature Ecology and Evolution.
Data Sources
Channel planform metrics were derived from aerial imagery from the National Aerial Photogrammetric Program (NAPP, 1995 - 1998) and the National Agricultural Imagery Program (2019 - 2020). Genetics data were derived from fish tissue samples collected from 2020 - 2022 by the authors. Hydrographic features and associated attributes used in determining values in some of the data were derived from the National Hydrography Dataset version 2 (NHD+v2).
Recommended Citation
Stearman, Loren W., and Jake F. Schaefer. In Review. Altered metapopulation dynamics in a headwater specialist in geomorphically dynamic watersheds. Freshwater Biology.
References (in this ReadMe)
Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633–2635.
Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics 27:2156–2158.
Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple Genotype-by-Sequencing (GBS) approach for high diversity species. PLOS ONE 6:1–10.
Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software structure: a simulation study. Molecular Ecology 14:2611–2620.
Glaubitz, J. C., T. M. Casstevens, F. Lu, J. Harriman, R. J. Elshire, Q. Sun, and E. S. Buckler. 2014. TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. PLOS ONE 9:e90346.
Johnson, L. K., C. T. Brown, and A. Whitehead. 2019. Draft genome assemblies of killifish from the Fundulus genus with ONT and Illumina sequencing platforms. Zenodo.
Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357–359.
Ross, S. T., W. M. Brenneman, W. T. Slack, M. T. O’Connell, and T. L. Peterson. 2001. Inland Fishes of Mississippi. University Press of Mississippi, Jackson, MS.
Description of the data and file structure
Multiple files in this repository require Linux to generate. Users are provided with the appropriate scripts to do so if they wish; however, users are also provided with the file outputs so that obtaining access to a machine with a Linux operating system is not a requirement to replicate our analyses. Missing data are coded as either 9 or NA. In file names below, X refers to the associated Structure run (Runs 1 - 4) and Y refers to the associated geographic group within that run. Files with common descriptions, structure, or methods are grouped below.
Files and Folders
012/012_RX_Y.csv
These files contain 012 (genotype) matrices at different spatial scales determined either by Structure analyses or for watersheds individually. Run numbers (X) are 1-4 and WS (Structure 1-4, WS = watershed), and group abbreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is 012/012_R3_BP_CC.csv.
afd/X_Y_afd_pops.csv
These files contain pairwise Allele Frequency Differences for either the entire genetics dataset (afd/All_afd_pops.csv) or for within individual watersheds (e.g., afd/WS_BP_afd_pops.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering processes at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.
fst/X_Y_fst_pops.csv
These files contain pairwise F[ST] values for either the entire genetics dataset (afd/All_afd_pops.csv for populations, afd/All_afd_sys.csv for comparisons across systems) or for within individual watersheds (e.g., fst/WS_BP_fst_pops.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering processes at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.
het/X_Y_het_ind.csv
These files contain individual heterozygosities calculated at either the entire dataset scale (het/All_het_ind.csv) or within individual watersheds (e.g., het/WS_BP_het_ind.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.
spatial/Catchments_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contain polygons for the 17 study watersheds, derived from the NHD+ v2 dataset. Spatial reference is EPSG 6509.
spatial/Channel_Centers_95_98_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel centerline features for randomly sampled study reaches in each watershed, derived from 1995 - 1998 NAPP imagery. Spatial reference is EPSG 6509.
spatial/Channel_Centers_19_20_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel centerline features for randomly sampled study reaches in each watershed, derived from 2019 - 2020 NAIP imagery. Spatial reference is EPSG 6509.
spatial/Channel_Widths_95_98_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel width features for randomly sampled study reaches in each watershed, derived from 1995 - 1998 NAPP imagery. Spatial reference is EPSG 6509.
spatial/Channel_Widths_19_20_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel width features for randomly sampled study reaches in each watershed, derived from 2019 - 2020 NAIP imagery. Spatial reference is EPSG 6509.
spatial/Fundulus_Sites_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains point feature spatial locations of collections of Fundulus olivaceus for tissue samples. Spatial reference is EPSG 6509.
structure/structure_RX_Y.str
These files contain filtered genetic data derived from input vcf file (stearmanallproductionvcf20221222.vcf.gz) at different spatial scales determined by sequential previous runs of Structure analyssis. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is structure/structure_R3_BP_CC.str.
structure_post/RX_Y.zip
These zip archives contain the results from Structure analysis on files in the directory structure/. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is structure_post/R3_BP_CC.zip.
summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv
These files contain summary results following Structure file creation and include metrics such as heterozygosity, read depth, and percent missing data for individuals. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek are summaries/output_R3_BPCC.txt and summaries/summary_R3_BPCC.csv.
fid_Master_Index.csv
This file provides a master index for spatial data files to merge watershed and river kilometer to spatial objects.
stearmanallproductionvcf20221222.vcf.gz: This zip archive contains genetic data for individual samples in variant call format (vcf).
stearmanallproductionvcf20221222.vcf.gz
This file contains genomic data recovered from Genotype-by-Sequencing for Fundulus olivaceus specimens. The data in this file have not been filtered, but have been demultiplexed from the original reads and reassembled into a format which can be analyzed by R package "vcfR".
Structure_k_assignment.csv
This file contains the group k assigment for each run of Structure, as determined by sequential Structure analyses. These are k for each group in each structure run. The file is called to facilitate grouping and analyses in various scripts.
vcf_run_groups.csv
This file contains various metadata information and run groups for each individual fish sequenced.
watershed_characteristics.csv
This file contains various watershed level metrics used in analyses of watershed and planform metrics.
Methodology
012/012_RX_Y.csv, afd/X_Y_afd_pops.csv, fst/X_Y_fst_pops.csv, het/X_Y_het_ind.csv
All genetics metric matrices (012 matrices, afd matrices, fst matrices, and individual heterozygosities) were generated during post-processing of the vcf file (stearmanallproductionvcf20221222.vcf.gz) via the script structure_pre_processor.R. This script filtered individuals first by those which were in a particular run level and K group, then applied quality filters across loci, and then removed individuals with >30% missing data. 012 matrices were generated using the function vcf012, fst matrices were generated using the function calc_fst, afd matrices were generated using the function AFD (following the methods of Berner 2019), and heterozygosities were calculated as the proportion of 012 matrix values equal to 1 (heterozygous condition). These files are an output of structure_pre_processor.R, and are provided for users who do not run a Linux operating system (required by structure_pre_processor.R and "vcfR").
structure/structure_RX_Y.str, summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv
Structure input files and summary files were generated during post-processing of the vcf file (stearmanallproductionvcf20221222.vcf.gz) via the script structure_pre_processor.R. This script filtered individuals first by those which were in a particular run level and K group, then applied quality filters across loci, and then removed individuals with >30% missing data. Structure files were generated with the function vcf_structure. Summary statistics are were calculated direction in the script structure_pre_processor.R. These files are an output of structure_pre_processor.R, and are provided for users who do not run a Linux operating system (required by structure_pre_processor.R and "vcfR").
structure_post/RX_Y.zip
Structure output files were generated during runs of the program Structure. These utilized the Structure input files.
spatial/Catchments_6509
These files were created by identifying all NHD+ line features upstream of the mouth of each watershed, then using these features to select all appropriate catchment features, and then dissolving these catchment features. Watershed mouth was defined as the confluence with a major geographic and ecological break (i.e., the Mississippi River, Lake Pontchartrain, or the Gulf of Mexico). The exception to this is the Amite and Comite Rivers, which were split at their confluence (a large wetland complex) and the Homochitto and Buffalo Rivers, which historically shared a complex distributary system near the MS (split by current courses).
spatial/Channel_Centers_95_98_100km_6509, spatial_Channel_Centers_19_20_100km_6509, spatial/Channel_Widths_95_98_100km_6509, spatial_Channel_Widths_19_20_100km_6509
Aerial imagery from NAPP (1995 - 1998) and NAIP (2019 - 2020) were used to extract planform metrics. We identified all streams > 100km2 area in our study watersheds, and aligned and merged each flowline from the upstream-most point to the watershed mouth. We then used a stratified random sampling regime to sample three 1-km reaches per 10km of stream. Within each reach, we digitized unvegetated channel centerline features, and digitized cross sections perpendicular to the centerline at 200m intervals. Random sampled locations are shared between NAPP and NAIP imagery.
spatial/Fundulus_Sites_6509
Sample sites were selected based on a combination of historical records and exploratory sampling. Site latitude and longitude were recorded during sampling events and digitized into shapefile format.
spatial/Streams_6509
Stream features were created by identifying all NHD+ line features upstream of the mouth of each study watershed.
structure/structure_RX_Y.str
Structure files were created first by filtering raw genetic data and then exporting to structure format using the R package "vcfR".
structure_post/RX_Y/RX_Y.zip
Structure output files were created by running structure on structure input files. Zip directories contain ten replicates for each level of K run for structure, for each geographic subsetting level.
summaries/output_RX_Y.txt
Structure file creation summary files (pre-analysis) were automatically generated following structure file creation.
summaries/output_RX_Y.csv
Structure result summary files (pre-analysis) were generated generated following structure file creation. Percent missing data, mean heterozygosity, and mean read depth were calculated following filtering of raw genotype data.
fid_Master_Index.csv
This file was created via spatial mergers of the 1km stream segments used in random reach selection with the NHD+ PlusflowVAA table (COMID, Strahler Stream Order, TDA, and DDA), a 100m buffer around the 0.2km river kilometer markers generated with QChainage on merged flowlines (RKM), and the spatial/Catchments_6509 feature (Name and Basin).
stearmanallproductionvcf20221222.vcf.gz
This file was created as a product of genetic tissue sequencing. Tissues were extracted using DNEasy blood and tissue kits (Quiagen), and sequenced by Genotype-By-Sequencing (GBS, Elshire et al. 2011) to obtain Single Nucleotide Polymorphisms (SNPs). EcoT221 restriction enzyme was used to process DNA prior to PCR amplification. Individuals were sequenced on an Illumina Hiseq platform. Fragments were aligned to a reference Fundulus olivaceus genome (Johnson et al. 2019) using Bowtie 2.0 (Langmead and Salzberg 2012). We genotyped reads with TASSEL (Bradbury et al. 2007), the results of which were exported to variant call format (vcf) files stored in this zip archive.
Structure_k_assignment.csv
This file was created during sequential Structure analyses. Optimal K was determined using the Evanno method (Evanno et al. 2005).
vcf_run_groups.csv
This file was created during sequential Structure analyses. Individuals were assigned to a group based on Q scores from Structure post-processing. Individual assignment followed the maximum Q score in any group. Watersheds where individuals were captured (WS), original field sample ID (GenID), and the population locality (pop) were merged from original data sheets (GenID) or spatial operations.
watershed_characteristics.csv
Major watershed characteristics were first extracted from the NHD+ and spatial/Catchments_6509 layers (Name, Basin, AreaKM2). Watershed shape was calculated by dividing the total area by the square of the length of the longest flowpath in each watershed. Watershed slope was calculated as the mean of the NHD+ elevslope table values for streams > 100km2 (presumably those underoing the most dramatic geomorphic change). Percent forest and urban land use were extracted from the National Land Cover Datbase 2019 dataset. Site mean distances (SiteMD) were calculated using the R package "riverdist", and the layers spatial/Streams_6509 and spatial/Fundulus_Sites_6509.
File Details
All spatial files are collections of *.cpg, *.dbf, *.prj, *.qmd, *.shp, and *.shx files.
012/012_RX_Y.csv
Number of variables: variable, matches number of loci at the geographic filtering scale.
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list: (numeric) genotypes in an 012 format
Values:
- 0: Homozygous for common allele
- 1: Heterozygous
- 2: Homozygous for rare allele
Data type: numeric
Missing data value: NA
afd/X_Y_afd_pops.csv
Number of variables: variable, matches number of sample localities at the geographic filtering scale.
Number of rows: variable, matches number of sample localities at the geographic filtering scale.
Variable list: (numeric) Allele frequency difference (afd) values
Data type: numeric
Missing data value: NA
fst/X_Y_fst_pops.csv
Number of variables: variable, matches number of sample localities at the geographic filtering scale.
Number of rows: variable, matches number of sample localities at the geographic filtering scale.
Variable list: (numeric) Fixation index (F[ST]) values
Data type: numeric
Missing data value: NA
het/X_Y_het_ind.csv
Number of variables: 2
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list:
- sample: (alphanumeric) The sample ID for the individual sequenced.
- Het: The mean heterozygosity of all loci filtered at the geographic filtering scale.
Data type: alphanumeric, numeric
Missing data value: NA
spatial/Catchments_6509
Projection: EPSG 6509
Units: Meters
Extent: 578703.7234256960218772,95091.0202181703352835 : 850164.2521547474898398,302126.0561933108838275
Geometry: Polygon (Multipolygon)
Variable count: 3
Feature count: 17
Fields:
- Name: (character) A two-character unique identifier for the watershed.
- Basin: (character) A three-character unique identifier for the biogeographic region/basin as defined by Ross et al. (2001)
- Area: (numeric) The area (meters^2) of the watershed. Data type: character, numeric Missing data value: NA
spatial/Channel_Centers_95_98_6509
Projection: EPSG 6509
Units: Meters
Extent: 588736.4942081926856190,96529.2829136248328723 : 839064.0288938571466133,292907.0399084257078357
Geometry: Line (MultiLineString)
Variable count: 4
Feature count: 567
Fields:
- fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
- RKM: (numeric) The river kilometer, from the downstream base of the watershed.
- LengthM (numeric) The length of the channel centerline feature (m)
- Year: (numeric) The aerial imagery year of the observation. Data type: numeric Missing data value: NA
spatial/Channel_Centers_19_20_6509
Projection: EPSG 6509
Units: Meters
Extent: 588736.0058691384037957,96531.7282540517044254 : 839071.2701355047756806,292897.4984815907664597
Geometry: Line (MultiLineString)
Variable count: 4
Feature count: 567
Fields:
- fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
- RKM: (numeric) The river kilometer, from the downstream base of the watershed.
- LengthM (numeric) The length of the channel centerline feature (m)
- Year: (numeric) The aerial imagery year of the observation. Data type: numeric Missing data value: NA
spatial/Channel_Widths_95_98_6509
Projection: EPSG 6509
Units: Meters
Extent: 588899.9480761414160952,96524.7174289985414362 : 839010.5488149614538997,292899.4530549644259736
Geometry: Line (MultiLineString)
Variable count: 6
Feature count: 2834
Fields:
- TID: (numeric) A unique identifier for each cross-channel transect.
- fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
- KM: (numeric) The river kilometer of the transect, from the downstream base of the watershed.
- Year: (numeric) The aerial imagery year of the observation.
- LengthM (numeric) The length of the transect (m) spanning the unvegetated channel.
- RKM: (numeric) The river kilometer, from the downstream base of the watershed. Data type: numeric Missing data value: NA
spatial/Channel_Widths_19_20_6509
Projection: EPSG 6509
Units: Meters
Extent: 588886.5448370444355533,96527.9845805226650555 : 839011.0730742360465229,292902.8447326631867327
Geometry: Line (MultiLineString)
Variable count: 6
Feature count: 2830
Fields:
- TID: (numeric) A unique identifier for each cross-channel transect.
- fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
- KM: (numeric) The river kilometer of the transect, from the downstream base of the watershed.
- Year: (numeric) The aerial imagery year of the observation.
- LengthM (numeric) The length of the transect (m) spanning the unvegetated channel.
- RKM: (numeric) The river kilometer, from the downstream base of the watershed. Data type: numeric Missing data value: NA
spatial/Fundulus_Sites_6509
Projection: EPSG 6509
Units: Meters
Extent: 587449.5703422533115372,105346.8814515384001425 : 838477.7530437646200880,292878.6534213094273582
Geometry: Point (Point)
Variable count: 8
Feature count: 88
Fields:
- pop: (alphanumeric) A unique site identifier.
- Lat: (numeric) The latitude in decimal degrees, NAD83.
- Lon: (numeric) The longitude in decimal degrees, NAD83.
- Stream: (character) The name of the stream where the site is located.
- Locale: (character) A text string specifying the location of the site relative to local landmarks.
- System: (character) the watershed/system containing the site.
- GenID: (character) A string specifying all GenID values (sample collection identifiers) associated with the site.
- nFolivaceu: (numeric) The number of Fundulus olivaceus collected at the locality. Data type: alphanumeric, character, numeric Missing data value: NA
spatial/Streams_6509
Projection: EPSG 6509
Units: Meters
Extent: 580346.8646270645549521,95384.0065049122058554 : 848940.3462441070005298,301512.6286104589817114
Geometry: Line (MultiLineStringZM)
Variable count: 7
Feature count: 12580
Fields:
- COMID: (numeric) The NHD+ V2 common identifier for the stream segment.
- SSO: (numeric) The Strahler Stream Order for the stream segment.
- TDA: (numeric) The total drainage area (km^2) upstream of the downstream end of the stream segment.
- DDA: (numeric) The adjusted drainage area (km^2) upstream of the downstream end of the stream segment.
- Name: (character): A two-character unique identifier for the watershed containing the stream segment.
- Basin: (character): A three-character unique identifier for the biogeographic region/basin containing the stream segment, as defined in Ross et al. (2001). Data type: character, numeric Missing data value: NA
structure/structure_RX_Y.str
Number of variables: variable, matches number of loci at the geographic filtering scale.
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable values:
- Description: genotypes in structure format (two digit, each specifying one allele at a biallelic locus)
- Values: (numeric) 1, 2, 3, 4 Data type: numeric Missing data value: 9
structure_post/RX_Y/RX_Y
These files contain detailed and default outputs from structure analysis used in selection of K (number of populations). File structure varies between runs and is set by structure. However, files typically contain the command line argument used to run the analysis, basic run parameters, a table of inferred proportional membership at the level of K selected, allele frequency divergence estimates among populations, a table of inferred ancestry by individual, and an extensive list of allele frequency estimates at each locus.
summaries/output_RX_Y.txt
These files contain automated outputs from VCFtools following output of a VCF file. Files are structured as blocks of text and are retained for end users. Files list number of individuals and sites kept after filtering.
summaries/summary_RX_Y.csv
Number of variables: 4
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list:
- sample: (alphanumeric) The sample ID for the individual sequenced.
- missing: (numeric) The proportion (0-1) of missing loci for the individual.
- heterozygosity: (numeric) The mean heterozygosity for the individual.
- meandepth: (numeric) The mean read depth for the individual across all loci at the geographic filtering scale.
Data type: alphanumeric, numeric
Missing data value: NA
fid_Master_Index.csv
Number of variables: 8
Number of rows: 2844
Variable list:
- fid: (numeric) A unique numeric identifier code assigned to watersheds during GIS processing.
- COMID: (numeric) The NHD+ V2 common identifier for the stream segment.
- TDA: (numeric) The total drainage area (km^2) for the stream segment.
- DDA: (numeric) The adjusted drainage area (km^2) for the stream segment.
- RKM: (numeric) The river kilometer for the measurements taken during GIS analyses.
- Name: (character) A two-character identifier for the watershed.
- Basin: (character) A three-character identifier for the biogeographic division/basin as defined in Ross et al. (2001).
Data type: numeric, character
Missing data value: NA
stearmanallproductionvcf20221222.vcf.gz
Number of metadata rows: 10
Number of header rows: 11
Number of variables: 730
Number of rows: 158087
Variable list:
- CHROM: (numeric) The chromosome for the locus.
- POS: (numeric) The position of the locus.
- ID: (alphanumeric) A unique identifier for the locus.
- REF: (character, A, C, G, T) The reference allele(s) for the locus.
- ALT: (character, A, C, G, T) Alternate allele(s) for the locus.
- QUAL: (numeric) Phred-scaled quality score variant call.
- FILTER: (character) PASS if the locus passed filtration; otherwise an indicator of quality of variant call.
- INFO: (character) Additional information
- FORMAT: (character) A character string specifying the format of the calls.
- Variables 10-721 are read calls for each individual. Data type: alphanumeric, character, numeric Missing data value: blank (default from structure and vcftools)
Structure_k_assignment.csv
Number of variables: 2
Number of rows: 31
Variable list:
- Run_Group: The geographic grouping for a given level of structure analysis.
- K: The recovered number of populations for the Run_Group at a given level of structure analysis.
Data type: character, numeric
Missing data value: NA
vcf_run_groups.csv
Number of variables: 8
Number of rows: 722
Variable list:
- sample: (alphanumeric) The sample ID for the individual sequenced.
- R1: (character) The run group to which an individual assigned during structure analysis for the first level of analyses.
- R2: (character) The run group to which an individual assigned during structure analysis for the second level of analyses.
- R3: (character) The run group to which an individual assigned during structure analysis for the third level of analyses.
- R4: (character) The run group to which an individual assigned during structure analysis for the fourth level of analyses.
- WS: (character) A two-character identifier for the watershed in which an individual was collected.
- GenID: (alphanumeric) A unique sample identifier linked to both spatial location and date of sample.
- pop: (alphanumeric) A unique site identifier.
Data type: alphanumeric, character, numeric
Missing data value: NA
watershed_characteristics.csv
Number of variables: 8
Number of rows: 18
Variable list:
- Name: (character) A two-character unique identifier for the watershed.
- Basin: (character) A three-character unique identifier for the biogeographic region/basin as defined in Ross et al. (2001)
- AreaKM2: (numeric) The drainage area in square kilometers for the entire watershed.
- Shape: (numeric) A shape metric (0 = perfectly fan shaped, 1 = perfectly linear) for the watershed.
- Slope: (numeric) The slope (m/m) of the watershed.
- PerFor: (numeric) The proportion of forested land cover in the watershed.
- PerUrb: (numeric) The proportion of urban land cover in the watershed.
- SiteMD: (numeric) The mean distance (m0 of each site to every other site in the watershed.
Data type: character, numeric
Missing data value: NA
Code/Software
analysis_script.R
This file conducts the majority of analyses used in the manuscript. It does not process vcf files, and does not create 012 matrices, afd files, fst files, heterozygosity files, structure input files, or the vcf summary files.
Language and Environment
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- BiodiversityR
- car
- corrplot
- DescTools
- emmeans
- gridExtra
- mgcv
- multcomp
- multcompView
- plotrix
- pophelper
- remotes
- sf
- vegan
structure_pre_processor.R
This file processes the raw vcf file (stearmanallproductionvcf20221222.vcf.gz) to create 012 matrices, afd files, fst files, heterozygosity files, structure input files, and the vcf summary files. This file must run in a Linux environment.
Language and Environment
R Environment for Statistical Computing
R 4.1.2
Dependencies
- adegenet
- DescTools
- tidyverse
- vcfR
vcf_funcs.R
This file contains functions required by the file structure_pre_processor.R
Language and Environment
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- DescTools
- tidyverse
- vcfR
Methods
Fluvial geomorphic measurements were collected using a remote-sensing approach with QGIS (open-source) geospatial software and National Aerial Photogrammetric Program (NAPP, 1995-1998) and National Agricultural Imagery Program (NAIP, 2019-2020) aerial imagery. River mainstems (>100km2 drainage area) were subsampled at 1km intervals with a stratified random approach (3km sampled per 10km of stream). Planform metrics used to estimate planform characteristics or dynamics (i.e., unvegetated channel width and centerlines) were digitized for each sampled reach in each time period (1990s and 2020s) and exported as shapefiles. Associated watershed metrics pertinent to gene flow were extracted directly or calculated from the NHD+ dataset (e.g., watershed size), or the National Land Cover Database 2019 dataset. Blackspotted Topminnows were sampled from multiple localities per watershed by standard fish sampling techniques. Fin clips were taken in the field, and DNA was extracted and sequenced. Raw sequence data was aligned to a reference genome, SNPs were identified, and data were filtered to produce a final SNP dataset for analysis. As filtering utilized the R package “vcfR”, which requires Linux, we additionally provide both the code and the outputs of those portions of the data analysis workflow to facilitate universal use of the dataset.