Planform change and Fundulus SNP data for small watersheds in South Mississippi and Louisiana

Stearman, Loren 1 ; Schaefer, Jake1

Research facility: University of Southern Mississippi

Published Apr 08, 2024 on Dryad. https://doi.org/10.5061/dryad.nzs7h44xg

Data files

Apr 08, 2024 version files 399.08 MB

20240405.zip

399.05 MB
README.md

33.74 KB

Abstract

Fluvial geomorphic processes and the resulting patterns of landform morphogenesis affect the distribution and connectivity of habitat patches for aquatic organisms. Human alterations to fluvial geomorphic processes may affect local habitat quality and stability, and affect connectivity of habitat patches by altering the distribution, supply, and movement of landform-generating materials. This dataset examines 17 watersheds in south Mississippi and southeastern Louisiana and was used in preparation of a manuscript addressing the hypothesis that elevated planform movement, indicative of advanced fluvial erosion, would cause fragmentation among populations of a headwater specialist (Blackspotted Topminnow Fundulus olivaceus). The dataset includes numerous spatial features derived from the NHD+ dataset used in planform measurements, spatial features digitized from NAPP and NAIP aerial imagery measuring planform characteristics and dynamics, additional metrics of each watershed, and a population genetics dataset of single nucleotide polymorphisms (SNPs) for multiple individuals at multiple sites per watershed. Associated code to recreate all analyses in the manuscript is provided.

https://doi.org/10.5061/dryad.nzs7h44xg

Principle Investigator Contact Information

Name: Loren Stearman
Institution: University of Southern Mississippi
Email: Loren.Stearman@usm.edu

Alternate Contact Information

Name: Jake Schaefer
Institution: University of Southern Mississippi
Email: Jake.Schaefer@usm.edu

Dataset Overview

This dataset contains the data and code required to replicate analyses in Stearman and Schaefer (in review), testing the hypothesis that the rate of geomorphic activity in a river, as inferred from planform dynamics, affects the rate of gene flow among populations of a headwater specialist, Blackspotted Topminnow (Fundulus olivaceus). Data cover 17 watersheds in southern Mississippi and southwest Louisiana. Geospatial data include planform metrics (channel widths and centerlines) for multiple 1km subsample reaches per watershed, for streams >100km2 in drainage area, and two time periods (early 1990s and 2020s) as well as auxiliary files necessary for analyses. Additional watershed metrics are included for use in analyses. Genetic data include a dataset of single nucleotide polymorphisms (SNPs), and various processed results from this dataset to facilitate analyses. Analysis of planform metrics found that while watersheds were generally similar in the directionality of planform behavior over time (narrowing and becoming more sinuous, with measurable displacement), the rate of change varied considerably, suggesting differing degrees of watershed responsiveness to human perturbations (channelization, mining, etc) and natural perturbations (a large flood event in the 1980s). Metrics of genetic differentiation and heterozygosity showed significant relationships with metrics of watershed geomorphic change. Analysis of population structure revealed that while geographic structuring was a strong driver of genetic structure, more variation existed in more geomorphically active watersheds, and several localities appear to have ancestral origin from an adjacent watershed. Examination of patterns of heterozygosity found evidence in three of five cases consistent with population bottlenecks in the host watershed, and probable recolonization (founder effect) from the presumed donor watershed.

Dates of Data Collection

A. NAPP Aerial Imagery: 1995-1998
B. NAIP Aerial Imagery: 2019-2020
C. Extraction of planform metrics: 2019-2023
D. Extraction of other watershed-level metrics: 2021-2023
E. Fish tissue collections: 2020 - 2022
F. Genetic sequencing data: 2021-2022

Data Spatial Scope

Data were collected in 17 small to medium-sized watersheds in southwest Mississippi and southern Louisiana. Latitudes and longitudes for tissue collection localities are available in the shapefile spatial/Fundulus_Sites_6509.shp. Catchment polygons for watersheds are available in the shapefile spatial/Catchments_6509. Planform metrics were restricted to randomly sampled reaches for streams >100km2 in drainage area.

Funding

Fish tissue collections and sequecning and geospatial data extraction and analysis were supported by a grant from the National Science Foundation (DEB-1556778) and from the U.S. Army Corps of Engineers (contract W912HZ21C0064).

Ethics Approval

Tissue collections were conducted under IACUC proocol 15102701.1, granted by the University of Southern Mississippi. Collections activities were performed under collections permit numbers 031191 and 0311201, granted by the Mississippi Department of Wildlife, Fisheries, and Parks, and WDP-22-090, granted by the Louisiana Department of Wildlife and Fisheries.

Sharing/Access information

Sharing/Access

This work is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license.
Files analytical_script.R, structure_pre_processor.R, and vcf_funcs.R are copyright under a GNU GPLv3 license.

Related Data Sources

USGS Earth Explorer. https://earthexplorer.usgs.gov/
Mississippi Automated Resource Information System (MARIS). https://maris.mississippi.edu/
Stearman, L. W., and J. F. Schaefer. In Review. Altered metapopulation dynamics in a headwater specialist in geomorphically dynamic watersheds. Nature Ecology and Evolution.

Data Sources

Channel planform metrics were derived from aerial imagery from the National Aerial Photogrammetric Program (NAPP, 1995 - 1998) and the National Agricultural Imagery Program (2019 - 2020). Genetics data were derived from fish tissue samples collected from 2020 - 2022 by the authors. Hydrographic features and associated attributes used in determining values in some of the data were derived from the National Hydrography Dataset version 2 (NHD+v2).

Recommended Citation

Stearman, Loren W., and Jake F. Schaefer. In Review. Altered metapopulation dynamics in a headwater specialist in geomorphically dynamic watersheds. Freshwater Biology.

References (in this ReadMe)

Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633–2635.

Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics 27:2156–2158.

Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple Genotype-by-Sequencing (GBS) approach for high diversity species. PLOS ONE 6:1–10.

Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software structure: a simulation study. Molecular Ecology 14:2611–2620.

Glaubitz, J. C., T. M. Casstevens, F. Lu, J. Harriman, R. J. Elshire, Q. Sun, and E. S. Buckler. 2014. TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. PLOS ONE 9:e90346.

Johnson, L. K., C. T. Brown, and A. Whitehead. 2019. Draft genome assemblies of killifish from the Fundulus genus with ONT and Illumina sequencing platforms. Zenodo.

Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357–359.

Ross, S. T., W. M. Brenneman, W. T. Slack, M. T. O’Connell, and T. L. Peterson. 2001. Inland Fishes of Mississippi. University Press of Mississippi, Jackson, MS.

Description of the data and file structure

Multiple files in this repository require Linux to generate. Users are provided with the appropriate scripts to do so if they wish; however, users are also provided with the file outputs so that obtaining access to a machine with a Linux operating system is not a requirement to replicate our analyses. Missing data are coded as either 9 or NA. In file names below, X refers to the associated Structure run (Runs 1 - 4) and Y refers to the associated geographic group within that run. Files with common descriptions, structure, or methods are grouped below.

Files and Folders

012/012_RX_Y.csv

These files contain 012 (genotype) matrices at different spatial scales determined either by Structure analyses or for watersheds individually. Run numbers (X) are 1-4 and WS (Structure 1-4, WS = watershed), and group abbreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is 012/012_R3_BP_CC.csv.

afd/X_Y_afd_pops.csv

These files contain pairwise Allele Frequency Differences for either the entire genetics dataset (afd/All_afd_pops.csv) or for within individual watersheds (e.g., afd/WS_BP_afd_pops.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering processes at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.

fst/X_Y_fst_pops.csv

These files contain pairwise F[ST] values for either the entire genetics dataset (afd/All_afd_pops.csv for populations, afd/All_afd_sys.csv for comparisons across systems) or for within individual watersheds (e.g., fst/WS_BP_fst_pops.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering processes at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.

het/X_Y_het_ind.csv

These files contain individual heterozygosities calculated at either the entire dataset scale (het/All_het_ind.csv) or within individual watersheds (e.g., het/WS_BP_het_ind.csv for Bayou Pierre). Note that watersheds are not just a subset of the larger file; vcf filtering at finer spatial scales recover more loci (fewer rare alleles) and thus are higher-resolution.

spatial/Catchments_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contain polygons for the 17 study watersheds, derived from the NHD+ v2 dataset. Spatial reference is EPSG 6509.

spatial/Channel_Centers_95_98_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel centerline features for randomly sampled study reaches in each watershed, derived from 1995 - 1998 NAPP imagery. Spatial reference is EPSG 6509.

spatial/Channel_Centers_19_20_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel centerline features for randomly sampled study reaches in each watershed, derived from 2019 - 2020 NAIP imagery. Spatial reference is EPSG 6509.

spatial/Channel_Widths_95_98_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel width features for randomly sampled study reaches in each watershed, derived from 1995 - 1998 NAPP imagery. Spatial reference is EPSG 6509.

spatial/Channel_Widths_19_20_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains unvegetated channel width features for randomly sampled study reaches in each watershed, derived from 2019 - 2020 NAIP imagery. Spatial reference is EPSG 6509.

spatial/Fundulus_Sites_6509

This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains point feature spatial locations of collections of Fundulus olivaceus for tissue samples. Spatial reference is EPSG 6509.

structure/structure_RX_Y.str

These files contain filtered genetic data derived from input vcf file (stearmanallproductionvcf20221222.vcf.gz) at different spatial scales determined by sequential previous runs of Structure analyssis. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is structure/structure_R3_BP_CC.str.

structure_post/RX_Y.zip

These zip archives contain the results from Structure analysis on files in the directory structure/. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek is structure_post/R3_BP_CC.zip.

summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv

These files contain summary results following Structure file creation and include metrics such as heterozygosity, read depth, and percent missing data for individuals. Run numbers (X) are 1-4, group appreviations typically match watershed abbreviations in the manuscript and throughout the dataset. For example, Run 3 Bayou Pierre + Coles Creek are summaries/output_R3_BPCC.txt and summaries/summary_R3_BPCC.csv.

fid_Master_Index.csv

This file provides a master index for spatial data files to merge watershed and river kilometer to spatial objects.
stearmanallproductionvcf20221222.vcf.gz: This zip archive contains genetic data for individual samples in variant call format (vcf).

stearmanallproductionvcf20221222.vcf.gz

This file contains genomic data recovered from Genotype-by-Sequencing for Fundulus olivaceus specimens. The data in this file have not been filtered, but have been demultiplexed from the original reads and reassembled into a format which can be analyzed by R package "vcfR".

Structure_k_assignment.csv

This file contains the group k assigment for each run of Structure, as determined by sequential Structure analyses. These are k for each group in each structure run. The file is called to facilitate grouping and analyses in various scripts.

vcf_run_groups.csv

This file contains various metadata information and run groups for each individual fish sequenced.

watershed_characteristics.csv

This file contains various watershed level metrics used in analyses of watershed and planform metrics.

Methodology

012/012_RX_Y.csv, afd/X_Y_afd_pops.csv, fst/X_Y_fst_pops.csv, het/X_Y_het_ind.csv

All genetics metric matrices (012 matrices, afd matrices, fst matrices, and individual heterozygosities) were generated during post-processing of the vcf file (stearmanallproductionvcf20221222.vcf.gz) via the script structure_pre_processor.R. This script filtered individuals first by those which were in a particular run level and K group, then applied quality filters across loci, and then removed individuals with >30% missing data. 012 matrices were generated using the function vcf012, fst matrices were generated using the function calc_fst, afd matrices were generated using the function AFD (following the methods of Berner 2019), and heterozygosities were calculated as the proportion of 012 matrix values equal to 1 (heterozygous condition). These files are an output of structure_pre_processor.R, and are provided for users who do not run a Linux operating system (required by structure_pre_processor.R and "vcfR").

structure/structure_RX_Y.str, summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv

Structure input files and summary files were generated during post-processing of the vcf file (stearmanallproductionvcf20221222.vcf.gz) via the script structure_pre_processor.R. This script filtered individuals first by those which were in a particular run level and K group, then applied quality filters across loci, and then removed individuals with >30% missing data. Structure files were generated with the function vcf_structure. Summary statistics are were calculated direction in the script structure_pre_processor.R. These files are an output of structure_pre_processor.R, and are provided for users who do not run a Linux operating system (required by structure_pre_processor.R and "vcfR").

structure_post/RX_Y.zip

Structure output files were generated during runs of the program Structure. These utilized the Structure input files.

spatial/Catchments_6509

These files were created by identifying all NHD+ line features upstream of the mouth of each watershed, then using these features to select all appropriate catchment features, and then dissolving these catchment features. Watershed mouth was defined as the confluence with a major geographic and ecological break (i.e., the Mississippi River, Lake Pontchartrain, or the Gulf of Mexico). The exception to this is the Amite and Comite Rivers, which were split at their confluence (a large wetland complex) and the Homochitto and Buffalo Rivers, which historically shared a complex distributary system near the MS (split by current courses).

spatial/Channel_Centers_95_98_100km_6509, spatial_Channel_Centers_19_20_100km_6509, spatial/Channel_Widths_95_98_100km_6509, spatial_Channel_Widths_19_20_100km_6509

Aerial imagery from NAPP (1995 - 1998) and NAIP (2019 - 2020) were used to extract planform metrics. We identified all streams > 100km2 area in our study watersheds, and aligned and merged each flowline from the upstream-most point to the watershed mouth. We then used a stratified random sampling regime to sample three 1-km reaches per 10km of stream. Within each reach, we digitized unvegetated channel centerline features, and digitized cross sections perpendicular to the centerline at 200m intervals. Random sampled locations are shared between NAPP and NAIP imagery.

spatial/Fundulus_Sites_6509

Sample sites were selected based on a combination of historical records and exploratory sampling. Site latitude and longitude were recorded during sampling events and digitized into shapefile format.

spatial/Streams_6509

Stream features were created by identifying all NHD+ line features upstream of the mouth of each study watershed.

structure/structure_RX_Y.str

Structure files were created first by filtering raw genetic data and then exporting to structure format using the R package "vcfR".

structure_post/RX_Y/RX_Y.zip

Structure output files were created by running structure on structure input files. Zip directories contain ten replicates for each level of K run for structure, for each geographic subsetting level.

summaries/output_RX_Y.txt

Structure file creation summary files (pre-analysis) were automatically generated following structure file creation.

summaries/output_RX_Y.csv

Structure result summary files (pre-analysis) were generated generated following structure file creation. Percent missing data, mean heterozygosity, and mean read depth were calculated following filtering of raw genotype data.

fid_Master_Index.csv

This file was created via spatial mergers of the 1km stream segments used in random reach selection with the NHD+ PlusflowVAA table (COMID, Strahler Stream Order, TDA, and DDA), a 100m buffer around the 0.2km river kilometer markers generated with QChainage on merged flowlines (RKM), and the spatial/Catchments_6509 feature (Name and Basin).

stearmanallproductionvcf20221222.vcf.gz

This file was created as a product of genetic tissue sequencing. Tissues were extracted using DNEasy blood and tissue kits (Quiagen), and sequenced by Genotype-By-Sequencing (GBS, Elshire et al. 2011) to obtain Single Nucleotide Polymorphisms (SNPs). EcoT221 restriction enzyme was used to process DNA prior to PCR amplification. Individuals were sequenced on an Illumina Hiseq platform. Fragments were aligned to a reference Fundulus olivaceus genome (Johnson et al. 2019) using Bowtie 2.0 (Langmead and Salzberg 2012). We genotyped reads with TASSEL (Bradbury et al. 2007), the results of which were exported to variant call format (vcf) files stored in this zip archive.

Structure_k_assignment.csv

This file was created during sequential Structure analyses. Optimal K was determined using the Evanno method (Evanno et al. 2005).

vcf_run_groups.csv

This file was created during sequential Structure analyses. Individuals were assigned to a group based on Q scores from Structure post-processing. Individual assignment followed the maximum Q score in any group. Watersheds where individuals were captured (WS), original field sample ID (GenID), and the population locality (pop) were merged from original data sheets (GenID) or spatial operations.

watershed_characteristics.csv

Major watershed characteristics were first extracted from the NHD+ and spatial/Catchments_6509 layers (Name, Basin, AreaKM2). Watershed shape was calculated by dividing the total area by the square of the length of the longest flowpath in each watershed. Watershed slope was calculated as the mean of the NHD+ elevslope table values for streams > 100km2 (presumably those underoing the most dramatic geomorphic change). Percent forest and urban land use were extracted from the National Land Cover Datbase 2019 dataset. Site mean distances (SiteMD) were calculated using the R package "riverdist", and the layers spatial/Streams_6509 and spatial/Fundulus_Sites_6509.

File Details

All spatial files are collections of *.cpg, *.dbf, *.prj, *.qmd, *.shp, and *.shx files.

012/012_RX_Y.csv

Number of variables: variable, matches number of loci at the geographic filtering scale.
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list: (numeric) genotypes in an 012 format
Values:

0: Homozygous for common allele
1: Heterozygous
2: Homozygous for rare allele

Data type: numeric
Missing data value: NA

afd/X_Y_afd_pops.csv

Number of variables: variable, matches number of sample localities at the geographic filtering scale.
Number of rows: variable, matches number of sample localities at the geographic filtering scale.
Variable list: (numeric) Allele frequency difference (afd) values
Data type: numeric
Missing data value: NA

fst/X_Y_fst_pops.csv

Number of variables: variable, matches number of sample localities at the geographic filtering scale.
Number of rows: variable, matches number of sample localities at the geographic filtering scale.
Variable list: (numeric) Fixation index (F[ST]) values
Data type: numeric
Missing data value: NA

het/X_Y_het_ind.csv

Number of variables: 2
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list:

sample: (alphanumeric) The sample ID for the individual sequenced.
Het: The mean heterozygosity of all loci filtered at the geographic filtering scale.

Data type: alphanumeric, numeric
Missing data value: NA

spatial/Catchments_6509

Projection: EPSG 6509
Units: Meters
Extent: 578703.7234256960218772,95091.0202181703352835 : 850164.2521547474898398,302126.0561933108838275
Geometry: Polygon (Multipolygon)
Variable count: 3
Feature count: 17
Fields:

Name: (character) A two-character unique identifier for the watershed.
Basin: (character) A three-character unique identifier for the biogeographic region/basin as defined by Ross et al. (2001)
Area: (numeric) The area (meters^2) of the watershed.
Data type: character, numeric
Missing data value: NA

spatial/Channel_Centers_95_98_6509

Projection: EPSG 6509
Units: Meters
Extent: 588736.4942081926856190,96529.2829136248328723 : 839064.0288938571466133,292907.0399084257078357
Geometry: Line (MultiLineString)
Variable count: 4
Feature count: 567
Fields:

fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
RKM: (numeric) The river kilometer, from the downstream base of the watershed.
LengthM (numeric) The length of the channel centerline feature (m)
Year: (numeric) The aerial imagery year of the observation.
Data type: numeric
Missing data value: NA

spatial/Channel_Centers_19_20_6509

Projection: EPSG 6509
Units: Meters
Extent: 588736.0058691384037957,96531.7282540517044254 : 839071.2701355047756806,292897.4984815907664597
Geometry: Line (MultiLineString)
Variable count: 4
Feature count: 567
Fields:

fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
RKM: (numeric) The river kilometer, from the downstream base of the watershed.
LengthM (numeric) The length of the channel centerline feature (m)
Year: (numeric) The aerial imagery year of the observation.
Data type: numeric
Missing data value: NA

spatial/Channel_Widths_95_98_6509

Projection: EPSG 6509
Units: Meters
Extent: 588899.9480761414160952,96524.7174289985414362 : 839010.5488149614538997,292899.4530549644259736
Geometry: Line (MultiLineString)
Variable count: 6
Feature count: 2834
Fields:

TID: (numeric) A unique identifier for each cross-channel transect.
fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
KM: (numeric) The river kilometer of the transect, from the downstream base of the watershed.
Year: (numeric) The aerial imagery year of the observation.
LengthM (numeric) The length of the transect (m) spanning the unvegetated channel.
RKM: (numeric) The river kilometer, from the downstream base of the watershed.
Data type: numeric
Missing data value: NA

spatial/Channel_Widths_19_20_6509

Projection: EPSG 6509
Units: Meters
Extent: 588886.5448370444355533,96527.9845805226650555 : 839011.0730742360465229,292902.8447326631867327
Geometry: Line (MultiLineString)
Variable count: 6
Feature count: 2830
Fields:

TID: (numeric) A unique identifier for each cross-channel transect.
fid: (numeric) A unique identifier for the watershed, generated during GIS calculations.
KM: (numeric) The river kilometer of the transect, from the downstream base of the watershed.
Year: (numeric) The aerial imagery year of the observation.
LengthM (numeric) The length of the transect (m) spanning the unvegetated channel.
RKM: (numeric) The river kilometer, from the downstream base of the watershed.
Data type: numeric
Missing data value: NA

spatial/Fundulus_Sites_6509

Projection: EPSG 6509
Units: Meters
Extent: 587449.5703422533115372,105346.8814515384001425 : 838477.7530437646200880,292878.6534213094273582
Geometry: Point (Point)
Variable count: 8
Feature count: 88
Fields:

pop: (alphanumeric) A unique site identifier.
Lat: (numeric) The latitude in decimal degrees, NAD83.
Lon: (numeric) The longitude in decimal degrees, NAD83.
Stream: (character) The name of the stream where the site is located.
Locale: (character) A text string specifying the location of the site relative to local landmarks.
System: (character) the watershed/system containing the site.
GenID: (character) A string specifying all GenID values (sample collection identifiers) associated with the site.
nFolivaceu: (numeric) The number of Fundulus olivaceus collected at the locality.
Data type: alphanumeric, character, numeric
Missing data value: NA

spatial/Streams_6509

Projection: EPSG 6509
Units: Meters
Extent: 580346.8646270645549521,95384.0065049122058554 : 848940.3462441070005298,301512.6286104589817114
Geometry: Line (MultiLineStringZM)
Variable count: 7
Feature count: 12580
Fields:

COMID: (numeric) The NHD+ V2 common identifier for the stream segment.
SSO: (numeric) The Strahler Stream Order for the stream segment.
TDA: (numeric) The total drainage area (km^2) upstream of the downstream end of the stream segment.
DDA: (numeric) The adjusted drainage area (km^2) upstream of the downstream end of the stream segment.
Name: (character): A two-character unique identifier for the watershed containing the stream segment.
Basin: (character): A three-character unique identifier for the biogeographic region/basin containing the stream segment, as defined in Ross et al. (2001).
Data type: character, numeric
Missing data value: NA

structure/structure_RX_Y.str

Number of variables: variable, matches number of loci at the geographic filtering scale.
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable values:

Description: genotypes in structure format (two digit, each specifying one allele at a biallelic locus)
Values: (numeric) 1, 2, 3, 4
Data type: numeric
Missing data value: 9

structure_post/RX_Y/RX_Y

These files contain detailed and default outputs from structure analysis used in selection of K (number of populations). File structure varies between runs and is set by structure. However, files typically contain the command line argument used to run the analysis, basic run parameters, a table of inferred proportional membership at the level of K selected, allele frequency divergence estimates among populations, a table of inferred ancestry by individual, and an extensive list of allele frequency estimates at each locus.

summaries/output_RX_Y.txt

These files contain automated outputs from VCFtools following output of a VCF file. Files are structured as blocks of text and are retained for end users. Files list number of individuals and sites kept after filtering.

summaries/summary_RX_Y.csv

Number of variables: 4
Number of rows: variable, matches number of individuals at the geographic filtering scale.
Variable list:

sample: (alphanumeric) The sample ID for the individual sequenced.
missing: (numeric) The proportion (0-1) of missing loci for the individual.
heterozygosity: (numeric) The mean heterozygosity for the individual.
meandepth: (numeric) The mean read depth for the individual across all loci at the geographic filtering scale.

Data type: alphanumeric, numeric
Missing data value: NA

fid_Master_Index.csv

Number of variables: 8
Number of rows: 2844
Variable list:

fid: (numeric) A unique numeric identifier code assigned to watersheds during GIS processing.
COMID: (numeric) The NHD+ V2 common identifier for the stream segment.
TDA: (numeric) The total drainage area (km^2) for the stream segment.
DDA: (numeric) The adjusted drainage area (km^2) for the stream segment.
RKM: (numeric) The river kilometer for the measurements taken during GIS analyses.
Name: (character) A two-character identifier for the watershed.
Basin: (character) A three-character identifier for the biogeographic division/basin as defined in Ross et al. (2001).

Data type: numeric, character
Missing data value: NA

stearmanallproductionvcf20221222.vcf.gz

Number of metadata rows: 10
Number of header rows: 11
Number of variables: 730
Number of rows: 158087
Variable list:

CHROM: (numeric) The chromosome for the locus.
POS: (numeric) The position of the locus.
ID: (alphanumeric) A unique identifier for the locus.
REF: (character, A, C, G, T) The reference allele(s) for the locus.
ALT: (character, A, C, G, T) Alternate allele(s) for the locus.
QUAL: (numeric) Phred-scaled quality score variant call.
FILTER: (character) PASS if the locus passed filtration; otherwise an indicator of quality of variant call.
INFO: (character) Additional information
FORMAT: (character) A character string specifying the format of the calls.
Variables 10-721 are read calls for each individual.
Data type: alphanumeric, character, numeric
Missing data value: blank (default from structure and vcftools)

Structure_k_assignment.csv

Number of variables: 2
Number of rows: 31
Variable list:

Run_Group: The geographic grouping for a given level of structure analysis.
K: The recovered number of populations for the Run_Group at a given level of structure analysis.

Data type: character, numeric
Missing data value: NA

vcf_run_groups.csv

Number of variables: 8
Number of rows: 722
Variable list:

sample: (alphanumeric) The sample ID for the individual sequenced.
R1: (character) The run group to which an individual assigned during structure analysis for the first level of analyses.
R2: (character) The run group to which an individual assigned during structure analysis for the second level of analyses.
R3: (character) The run group to which an individual assigned during structure analysis for the third level of analyses.
R4: (character) The run group to which an individual assigned during structure analysis for the fourth level of analyses.
WS: (character) A two-character identifier for the watershed in which an individual was collected.
GenID: (alphanumeric) A unique sample identifier linked to both spatial location and date of sample.
pop: (alphanumeric) A unique site identifier.

Data type: alphanumeric, character, numeric
Missing data value: NA

watershed_characteristics.csv

Number of variables: 8
Number of rows: 18
Variable list:

Name: (character) A two-character unique identifier for the watershed.
Basin: (character) A three-character unique identifier for the biogeographic region/basin as defined in Ross et al. (2001)
AreaKM2: (numeric) The drainage area in square kilometers for the entire watershed.
Shape: (numeric) A shape metric (0 = perfectly fan shaped, 1 = perfectly linear) for the watershed.
Slope: (numeric) The slope (m/m) of the watershed.
PerFor: (numeric) The proportion of forested land cover in the watershed.
PerUrb: (numeric) The proportion of urban land cover in the watershed.
SiteMD: (numeric) The mean distance (m0 of each site to every other site in the watershed.

Data type: character, numeric
Missing data value: NA

Code/Software

analysis_script.R

This file conducts the majority of analyses used in the manuscript. It does not process vcf files, and does not create 012 matrices, afd files, fst files, heterozygosity files, structure input files, or the vcf summary files.

Language and Environment

R Environment for Statistical Computing

Version

R 4.1.2

Dependencies

BiodiversityR
car
corrplot
DescTools
emmeans
gridExtra
mgcv
multcomp
multcompView
plotrix
pophelper
remotes
sf
vegan

structure_pre_processor.R

This file processes the raw vcf file (stearmanallproductionvcf20221222.vcf.gz) to create 012 matrices, afd files, fst files, heterozygosity files, structure input files, and the vcf summary files. This file must run in a Linux environment.

Language and Environment

R Environment for Statistical Computing

Version

R 4.1.2

Dependencies

adegenet
DescTools
tidyverse
vcfR

vcf_funcs.R

This file contains functions required by the file structure_pre_processor.R

Language and Environment

R Environment for Statistical Computing

Version

R 4.1.2

Dependencies

DescTools
tidyverse
vcfR

Planform change and Fundulus SNP data for small watersheds in South Mississippi and Louisiana

Data files

Abstract

README: Planform Change and Fundulus olivaceus SNP Data for Small Watersheds in South Mississippi and Louisiana

Dataset Overview

Dates of Data Collection

Data Spatial Scope

Funding

Ethics Approval

Sharing/Access information

Sharing/Access

Related Data Sources

Data Sources

Recommended Citation

References (in this ReadMe)

Description of the data and file structure

Files and Folders

012/012_RX_Y.csv

afd/X_Y_afd_pops.csv

fst/X_Y_fst_pops.csv

het/X_Y_het_ind.csv

spatial/Catchments_6509

spatial/Channel_Centers_95_98_6509

spatial/Channel_Centers_19_20_6509

spatial/Channel_Widths_95_98_6509

spatial/Channel_Widths_19_20_6509

spatial/Fundulus_Sites_6509

structure/structure_RX_Y.str

structure_post/RX_Y.zip

summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv

fid_Master_Index.csv

stearmanallproductionvcf20221222.vcf.gz

Structure_k_assignment.csv

vcf_run_groups.csv

watershed_characteristics.csv

Methodology

012/012_RX_Y.csv, afd/X_Y_afd_pops.csv, fst/X_Y_fst_pops.csv, het/X_Y_het_ind.csv

structure/structure_RX_Y.str, summaries/output_RX_Y.txt, summaries/summary_RX_Y.csv

structure_post/RX_Y.zip

spatial/Catchments_6509

spatial/Channel_Centers_95_98_100km_6509, spatial_Channel_Centers_19_20_100km_6509, spatial/Channel_Widths_95_98_100km_6509, spatial_Channel_Widths_19_20_100km_6509

spatial/Fundulus_Sites_6509

spatial/Streams_6509

structure/structure_RX_Y.str

structure_post/RX_Y/RX_Y.zip

summaries/output_RX_Y.txt

summaries/output_RX_Y.csv

fid_Master_Index.csv

stearmanallproductionvcf20221222.vcf.gz

Structure_k_assignment.csv

vcf_run_groups.csv

watershed_characteristics.csv

File Details

012/012_RX_Y.csv

afd/X_Y_afd_pops.csv

fst/X_Y_fst_pops.csv

het/X_Y_het_ind.csv

spatial/Catchments_6509

spatial/Channel_Centers_95_98_6509

spatial/Channel_Centers_19_20_6509

spatial/Channel_Widths_95_98_6509

spatial/Channel_Widths_19_20_6509

spatial/Fundulus_Sites_6509

spatial/Streams_6509

structure/structure_RX_Y.str

structure_post/RX_Y/RX_Y

summaries/output_RX_Y.txt

summaries/summary_RX_Y.csv

fid_Master_Index.csv

stearmanallproductionvcf20221222.vcf.gz

Structure_k_assignment.csv

vcf_run_groups.csv

watershed_characteristics.csv

Code/Software

analysis_script.R

Language and Environment

Version

Dependencies

structure_pre_processor.R

Language and Environment