Bayou Darter next-gen sequence data and analyses
Data files
Mar 21, 2025 version files 7.85 MB
-
BD_Dryad.zip
7.84 MB
-
README.md
17.19 KB
Abstract
Bayou Darters (Nothonotus rubrus) are a small benthic fish endemic to Bayou Pierre, Mississippi. The entire range of the species is threatened by geomorphic change due to accelerated erosion, channel incision, and channel evolution processes stemming from the migration of the Mississippi River and human alterations to the landscape. Bayou Darters have disappeared from much of mainstem Bayou Pierre, but have expanded their range upstream into formerly unoccupied tributaries, in concert with upstream progression of channel evolution processes. In this dataset, we document an effort to understand whether upstream dispersal into tributaries is affecting gene flow through fragmentation of populations. We used Next-Gen sequencing approaches to develop high-resolution molecular datasets to assess population structure, patterns of heterozygosity, and effective populations size, throughout the contemporary distribution of the species.
https://doi.org/10.5061/dryad.h70rxwdtm
Principle Investigator Contact Information
Name: Loren Stearman
Institution: University of Southern Mississippi
Email: Loren.Stearman@gmail.com
Alternate Contact Information
Name: Jake Schaefer
Institution: University of Southern Mississippi
Email: Jake.Schaefer@usm.edu
Name: Brian Kreiser
Institution: University of Southern Mississippi
Email: Brian.Kreiser@usm.edu
Dataset Overview
This dataset contains the data and code required to replicate analyses in Stearman et al. (in review) examining spatial patterns of population genetics structure in Bayou Darter, Nothonotus rubrus. The data’s spatial coverage includes 19 sites in the Bayou Pierre watershed, Mississippi. Genetic data include a dataset of single nucleotide polymorphisms (SNPs), and various processed results from this dataset to facilitate analyses for end users. Additional geospatial data are included for analysis of isolation by distance, and to facilitate data visualization. Analysis of population structure found evidence for three genetically distinct groups, one of which is isolated to the extreme headwaters of the system, and another of which is rare but widespread throughout the lower watershed. Analysis of heterozygosity found low heterozygosity in the White Oak Creek system, but no appreciable difference throughout the watershed. Analysis of effective population size suggested higher effective population size in upstream vs downstream areas.
Dates of Data Collection
A. Fish tissue collections: 2021 - 2022
B. Genetic sequencing data: 2021-2023
C. Geospatial features: 2021
Data Spatial Scope
Data were collected at 19 localities in mainstem and upper Bayou Pierre, White Oak Creek, Tallahalla Creek, Foster Creek, and Turkey Creek, Mississippi (all tributaries to Bayou Pierre). Latitudes and longitudes for tissue collection localities are available in the shapefile spatial/sites_6509.shp.
Funding
This research was funded by a grant from the U.S. Fish and Wildlife Service (F21AC02855). Analysis and writing were partially supported by a grant from the U.S. Army Corps of Engineers (contract W912HZ21C0064).
Ethics Approval
Tissue collections were conducted under IACUC proocol 15102701.1, granted by the University of Southern Mississippi. Collections activities were performed under collections permit numbers 031191 and 0311201, granted by the Mississippi Department of Wildlife, Fisheries, and Parks.
Sharing/Access information
This work is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license.
Files analysis_funcs.R, analysis_script.R, vcf_filter_script.R, and vcf_funcs.R are copyright under a GNU GPLv3 license.
Related Data Sources
Stearman, L. W., B. R. Kreiser, and J. F. Schaefer. In Review. Unexpected population genetic structure in Bayou Darter (Nothonotus rubrus) in the geomorphically dynamic Bayou Pierre, Mississippi. Endangered Species Research.
Data Sources
Genetics data were derived from fish tissue samples collected from 2021 - 2022 by the authors. Hydrographic features and associated attributes used in determining values in some of the data were derived from the National Hydrography Dataset version 2 (NHD+v2).
Recommended Citation
Stearman, Loren W., Brian R. Kreiser, and Jake F. Schaefer. Bayou Darter Next-Gen sequencing data and analyses.
References (in this ReadMe)
Bradbury, P. J., Z. Zhang, D. E. Kroon, T. M. Casstevens, Y. Ramdoss, and E. S. Buckler. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633–2635.
Danecek, P., A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, and 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics 27:2156–2158.
Elshire, R. J., J. C. Glaubitz, Q. Sun, J. A. Poland, K. Kawamoto, E. S. Buckler, and S. E. Mitchell. 2011. A robust, simple Genotype-by-Sequencing (GBS) approach for high diversity species. PLOS ONE 6:1–10.
Evanno, G., S. Regnaut, and J. Goudet. 2005. Detecting the number of clusters of individuals using the software structure: a simulation study. Molecular Ecology 14:2611–2620.
Glaubitz, J. C., T. M. Casstevens, F. Lu, J. Harriman, R. J. Elshire, Q. Sun, and E. S. Buckler. 2014. TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline. PLOS ONE 9:e90346.
Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9:357–359.
Description of the data and file structure
Files are downloaded in the zip file BD_Dryad.zip to preserve the directory structure for analyses.
Some files in this repository require Linux to generate. Users are provided with the appropriate scripts to do so if they wish; however, users are also provided with the file outputs so that obtaining access to a machine with a Linux operating system is not a requirement to replicate our analyses. Missing data are coded as either 9 or NA. Files with common descriptions, structure, or methods are grouped below.
File and folder descriptions
spatial/Bayou_Pierre_6509_TDAgt_47
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains line features for streams in Bayou Pierre, derived from the NHD+ v2 dataset. Spatial reference is EPSG 6509.
spatial/sites_6509
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains point features for sample localities. Spatial reference is EPSG 6509.
spatial/sites_6509_adj
This set of files (.cpg, .dbf, .qmd, .shp, .shx) contains point features for sample localities, adjusted to facilitate visualization when plotting data. Spatial reference is EPSG 6509.
structure/r2_114_845/BD.zip
This zip file contains a subdirectory (BD/) with files representing outputs from STRUCTURE analysis. File nomenclature is */BD/BD_K_ii_f, where K indicates number of groups used in STRUCTURE analysis, and ii indicates the replicate run (1 - 10). Example: */BD/BD_2_05_f is the 5th replicate run at K = 2.
base_meta.csv
This file contains the basic metadata for individuals analyzed.
bayoudarterproductionvcf20230305.vcf.gz
This file contains the results of sequencing and SNP calling of Bayou Darter tissues. This is the main data file. Data in this file have not yet been filtered via vcf scripts.
BD_114_1969_Regional_Ne.genepop
This file contains a filtered subset of individuals and loci from bayoudarterproductionvcf20230305.vcf.gz, for use in effective population size (Ne) analyses. Individuals are grouped into geographically proximate pools (n = 5, see base_meta.csv for groupings) as no sample site had sufficient individuals on its own for Ne analyses.
BD_114_845.structure
This file contains a filtered subset of individuals and loci from bayoudarterproductionvcf20230305.vcf.gz, for use in STRUCTURE and associated analyses.
BD012_114_845.csv
This file contains a genotype matrix (1 = heterozygous) for loci and indiviudals filtered for STRUCTURE and other population genetics analyses.
pop_order.csv
This file contains information necessary to order populations in data visualization.
Methods
spatial/Bayou_Pierre_6509_TDAgt_47
These files were created via selecting streams within the Bayou Pierre watershed and exporting them from the NHD+V2 dataset. Stream line features were then merged with the value-added attributes table to add Strahler Stream Order, total drainage area (km^2), and adjusted drainage area (km^2). The line features were then filtered to include only streams with a total drainage area >47km^2.
spatial/sites_6509
Sample sites were selected based on historical records for Bayou Darters. Site locations, latitude, and longitude were recorded during sampling events and digitized into shapefile format.
spatial/sites_6509_adj
These points were extracted from spatial/sites_6509 and manually adjusted to facilitate visualization in data analyses.
structure/r2_114_845/BD.zip
STRUCTURE output files were generated during runs of the program STRUCTURE. These required the input file structure/BD_114_845.structure.
base_meta.csv
This file was created following completion of sampling efforts. Individuals were assigned a sample number in the field with the format BYDXX-YY, where XX indicated the numerical order of the sample and YY indicated the numerical order of the individual in that sample (e.g., BYD02-05 was the fifth fish at the second sample). The first portion of this alphanumeric identifier (BYDXX) was extracted as the sample event ID, called GenID. Some sample sites required multiple visits and thus have more than one GenID value assigned to them. Sites (referred to as Pop in files to facilitate analyses) were given a different alphanumeric code, BDZZ, where ZZ indicated the site order from east to west. As some analyses required larger sample sizes than were possible with field tissue collections, we grouped sites into geographic groupings manually.
bayoudarterproductionvcf20230305.vcf.gz
This file was created as a product of genetic tissue sequencing. Tissues were extracted using DNEasy blood and tissue kits (Quiagen), and sequenced by Genotype-By-Sequencing (GBS, Elshire et al. 2011) to obtain Single Nucleotide Polymorphisms (SNPs). EcoT221 restriction enzyme was used to process DNA prior to PCR amplification. Individuals were sequenced on an Illumina Hiseq platform. Fragments were aligned to a reference Etheostoma spectabile genome using Bowtie 2.0 (Langmead and Salzberg 2012). We genotyped reads with TASSEL (Bradbury et al. 2007), the results of which were exported to variant call format (vcf) files stored in this archive.
BD_114_1969_Regional_Ne.genepop
This file was created during the vcf filtering process. Individuals were filtered with a set of constraints similar to those for structure/BD_114_845.structure and BD012_114_845.csv, except that restrictions for minimum allele frequency and minimum distance between loci were removed.
BD_114_845.structure
STRUCTURE input files were generated following the vcf filtering process with the function vcf_structure.
BD012_114_845.csv
This file was created during the vcf filtering process. Individuals were filtered with the same constraints as for structure/BD_114_845.structure. Genotype assignment and construction of the 012 matrix was done with function vcf012.
pop_order.csv
This file was created by manually assigning order values to sites/populations during analysis of the data.
File structure
spatial/Bayou_Pierre_6509_TDAgt_47
Projection: EPSG 6509\
Units: Meters\
Extent: 618036.4216703347628936,242701.0549141092051286 : 693348.0119855772936717,294369.4974200578872114\
Geometry: Line (MultiLineStringZM)\
Variable count: 5\
Feature count: 335\
Fields:
- COMID: (integer) the integer unique identifier used by NHD+ for the segment
- GNIS_NAME: (character) The Geographic Names Information System name for the segment
- SSO: (numeric) the Strahler Stream Order for the segment
- TDAKM2: (numeric) the total drainage area in square kilometers
- ADAKM2: (numeric) the adjusted drainage area
Data types: integer, character, numeric
Missing data value: NA
spatial/sites_6509
Projection: EPSG 6509
Units: Meters
Extent: 657410.8875029255868867,247586.2972333538637031 : 688359.1390572980744764,288323.4269915075274184
Geometry: Point (Point)
Variable count: 5
Feature count: 19
Fields:
- Lat: (numeric) The latitude in decimal degrees
- Lon: (numeric) The longitude in decimal degrees
- Stream: (character) The name of the stream
- Locale: (character) Locality information for the site
- Pop: (character) The assigned population number
Data types: character, numeric
Missing data value: NA
spatial/sites_6509_adj
Projection: EPSG 6509
Units: Meters
Extent: 657410.8875029255868867,245968.0946750311995856 : 692714.7546822886215523,288323.4269915075274184
Geometry: Point (Point)
Variable count: 1
Feature count: 19
Fields:
- Pop: (character) The assigned population number
Data types: character
Missing data value: NA
structure/r2_114_845/BD.zip
Files in this zip directory contain the command line argument used to run the analysis, basic run parameters, a table of inferred proportional membership at the level of K selected, allele frequency divergence estimates among populations, a table of inferred ancestry by individual, and an extensive list of allele frequency estimates at each locus.
base_meta.csv
Number of variables: 4
Number of rows: 114
Variable list:
- sample: (character) The formal individual sequence ID, matches the values in the vcf file
- GenID: (character) The GenID (individual field sample event) identifier, matches the first five characters of the sample ID
- Pop: (character) The site ID, matches site IDs in spatial/sites_6509
- Group: (character) The geographically proximate groups individuals were pooled into for Ne analyses due to insufficient sample size at any site for Ne analysis
Data type: character
Missing data value: NA
bayoudarterproductionvcf20230305.vcf.gz
Number of metadata rows: 10
Number of header rows: 11
Number of variables: 135
Number of rows: 13075
Variable list:
- CHROM: (numeric) The chromosome for the locus.
- POS: (numeric) The position of the locus.
- ID: (alphanumeric) A unique identifier for the locus.
- REF: (character, A, C, G, T) The reference allele(s) for the locus.
- ALT: (character, A, C, G, T) Alternate allele(s) for the locus.
- QUAL: (numeric) Phred-scaled quality score variant call.
- FILTER: (character) PASS if the locus passed filtration; otherwise an indicator of quality of variant call.
- INFO: (character) Additional information
- FORMAT: (character) A character string specifying the format of the calls. Variables 10-135 are read calls for each individual.
Data type: alphanumeric, character, numeric
Missing data value: blank (default from structure and vcftools)
BD_114_845.structure
Number of variables: 845
Number of rows: 114
Variable values:
- Description: genotypes in structure format (two digits, each specifying one allele at a biallelic locus) Values: (numeric) 1, 2, 3, 4
Data type: numeric
Missing data value: 9
BD_114_1969_Regional_Ne.genepop
Row 1: (character) title row
Rows 2-1970: (character) locus names
Row 1971: (character) group 1 identifier
Rows 1972 - 1990: (character) genotypes for individuals in group 1
Row 1991: (character) group 2 identifier
Rows 1992 - 2019: (character) genotypes for individuals in group 2
Row 2020: (character) group 3 identifier
Rows 2021 - 2041: (character) genotypes for individuals in group 3
Row 2042: (character) group 4 identifier
Rows 2043 - 2063: (character) genotypes for individuals in group 4
Row 2064: (character) group 5 identifier
Rows 2065 - 2089: (character) genotypes for individuals in group 5
Data type: character
Genotype format: 0101/0102/0201/0202/0000, where 01 and 02 are alleles, and 0000 indicates missing data
Missing data value: 0000
BD012_114_845.csv
Number of variables: 845
Number of rows: 114
Variable list: (numeric) genotypes in an 012 format
Values:
- 0: Homozygous for common allele
- 1: Heterozygous
- 2: Homozygous for rare allele
Data type: numeric
Missing data value: NA
pop_order.csv
Number of variables: 2
Number of rows: 19
Variable list:
- Pop: (character) The site or population in the analysis
- Order: (integer) The order of plotting for data visualization
Data type: character, integer
Missing data value: NA
Code/Software
analysis_funcs.R
This file contains custom tools needed by the file analysis_script.R.
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- None
analysis_script.R
This file conducts the majority of analyses used in the manuscript. It does not create 012 matrices, structure files, or genepop files.
Language and Environment
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- ade4
- gridExtra
- hierfstat
- plotrix
- pophelper
- riverdist
- sf
- vegan
vcf_filter_script.R
This script processes the raw vcf file (bayoudarterproductionvcf20230305.vcf.gz) to create structure input files, 012 matrices, and genepop files. This script must run in a Linux environment.
Language and Environment
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- adegenet
- DescTools
- tidyverse
- vcfR
vcf_funcs.R
This file contains functions required by the file structure_pre_processor.R
Language and Environment
R Environment for Statistical Computing
Version
R 4.1.2
Dependencies
- DescTools
- tidyverse
- vcfR
We sampled Nothonotus rubrus from 19 sites representing the known contemporary distribution of the species. We targeted a minimum of 5 individuals per site. Tissue samples were collected with fin clips (~0.1g) from individual N. rubrus. Tissue samples were stored in 95% etOH on ice, and individuals were released alive at their locality of capture. We extracted DNA using DNEasy blood and tissue kits (Qiagen), constructed libraries with the EcoT221 restriction enzyme, and sequenced tissues on an Illumina Hi-Seq platform. We used Genotype-By-Sequencing to identify SNPs and R package vcftools to filter SNPs and individuals for quality controls. We analyzed data using multiple approaches, including STRUCTURE, Principle Coordinates Analysis, NeEstimator V2.1, and measurements of heterozygosity and pairwise Fst calculated from filtered SNP data.