Complications of estimating hatchery introgression in the face of rapid divergence: A case study in brook trout (Salvelinus fontinalis)
Data files
Nov 22, 2024 version files 559.14 MB
-
Erdman_PineRiver_DRYAD_SampleData_v2.xlsx
97.34 KB
-
Erdman_PineRiver_Filtered_OriginalIDS.vcf
461.84 MB
-
Erdman_PineRiver_Filtered_RelabeledIDS.gen
97.19 MB
-
README.md
4.45 KB
Abstract
Fish stocking has been utilized for over a century to offset extirpations or declines in abundance of many native species. These historical declines and hatchery contributions have led to uncertainty surrounding whether many contemporary populations are native, introgressed with hatchery sources, or entirely of hatchery origin. Such uncertainty is problematic for the conservation of native biodiversity as it hampers management agencies’ ability to prioritize the conservation of indigenous locally adapted populations. Fortunately, genetic and genomic tools have allowed researchers to investigate these questions, often through the use of clustering or assignment approaches that are predicated on identifiable and consistent divergence between native populations and hatchery sources. Here, we apply these methods to genomic data from 643 brook trout (Salvelinus fontinalis) originating from 13 wild populations and an exogenous hatchery strain to investigate the extent of historical extirpations, hatchery contributions, and processes affecting population structure in a small area of the previously unglaciated Driftless Area of Wisconsin, USA. Results from these analyses suggest that wild populations in this region are genetically distinct even at small spatial scales, lack hydrologically associated population structure, rarely exchange gene flow, and have small effective population sizes. Further, wild populations are substantially diverged from known hatchery strains and show minimal evidence of introgression in clustering analyses. However, we demonstrate through empirically informed simulations that distinct wild populations may potentially be hatchery-founded and have since diverged through rapid genetic drift. Collectively, the lack of hydrological population structure and potential for rapid drift in the Driftless Area suggest that many native populations may have been historically extirpated and refounded by stocking events, and that commonly used genomic clustering methods and their associated model selection criteria may result in underestimation of hatchery introgression in the face of rapid drift.
README: Complications of estimating hatchery introgression in the face of rapid divergence: A case study in brook trout (Salvelinus fontinalis)
https://doi.org/10.5061/dryad.rbnzs7hk8
These datasets contain sample and RAD sequence information for 672 brook trout samples from Wisconsin.
Description of the data and file structure
File Descriptions
Erdman_PineRiver_Filtered_OriginalIDs.vcf includes filtered genotypes for all loci and individuals passing filtering criteria (see methods for full parameter specifications. This files uses original, unique sample identifiers than can be cross referenced with demographic and collection information in Erdman_PineRiver_DRYAD_SampleData.xlsx.
*Erdman_PineRiver_Filtered_RelabeledIDs.gen *includes filtered genotypes for all loci and individuals passing filtering criteria (see methods for full parameter specifications. Note that these genotypes are the same as the VCF file listed above. However, this files renames samples to include the waterbody they originated from followed by a unique integer. This file is provided merely as a convenience for labeling plots in downstream analysis. As before, unique identifiers than can be cross referenced with original sample identifiers, demographic, and collection information in Erdman_PineRiver_DRYAD_SampleData.xlsx.
Erdman_PineRiver_DRYAD_SampleData.xlsx includes demographic and collection information for the samples included in this study. Columns are as follows:
Sample ID = Original sample identifier used in VCF genotype file
GenepopID = Modified sample identifier used in Genepop genotype file. NA indicates that this individual did not pass thresholds for missing data during the filtering process and is thus not present in genotype files.
Library = Library preparation the sample was included in (each library consisted of one 96-well plate)
IndexSequence = Index sequence used in demultiplexing libraries
Owner = Owner of tissue sample (MCGL = Molecular Conservation Genetics Laboratory, University of Wisconsin-Stevens Point)
ProjectID = Unique code assigned to this project
SampleType = Type of tissue used for DNA extraction and downstream processes
Container = Vessel the tissue sample is housed in
Species = Scientific name of organism
CommonName1 = Common name of organism
Collectors = Individuals that collected the tissue samples
Agency/Hatchery = Agencies or facilities that collected the tissue samples
DateCollected = Date of tissue collection. Dates are reported in month/day/year format, when available. Otherwise they are reported as season and year or simply year if more specific information was not recorded at the time of sampling.
Year = Year of tissue collection
Sex = Sex of individual. NA indicates that these data were not collected/reported.
Physical Mark = Whether the individual possessed a physical mark (fin clip) at time of capture. NA indicates that these data were not collected/reported.
MarkType = The type of mark individuals possessed, if any. NA indicates that these data were not collected/reported.
TotalLength = Logical whether reported length represents total length.
ForkLength = Logical whether reported length represents fork length.
StandardLength = Logical whether reported length represents standard length.
Length(mm) = Length of individual in millimeters. NA indicates that these data were not collected/reported.
WaterbodyName = Name of waterbody where individual was collected.
County = Name of county where individual was collected. NA indicates that these data were not collected/reported.
State = Name of state where individual was collected.
WaterbodyID(WBIC) = Unique waterbody identification code (WBIC) assigned to where an individual was collected. NA indicates that this is a fish hatchery and thus does not have a WBIC assigned to it.
Latitude = Latitude corresponding to waterbody where individual was collected. Units are in degrees north. NA indicates that this is a fish hatchery and coordinates are thus not representative of a wild population.
Longitude = Longitude corresponding to waterbody where individual was collected. Units are in degrees west. NA indicates that this is a fish hatchery and coordinates are thus not representative of a wild population.
Comments = Any additional comments on sample collected. NA indicates a lack of additional comments.
Methods
Sampling
Fin clips were collected from 13 wild brook trout populations in Wisconsin and the St. Croix Falls hatchery strain with a target sample size of 50 fish per population. All wild samples were collected via backpack electrofishing from 2015-2019, St. Croix Falls hatchery strain samples were collected from broodstock maintained at the St. Croix Falls State Fish Hatchery in 2019, and all fin clips were preserved in 95% ethanol prior to DNA extraction. Our primary objective for these collections was to examine putative fine scale population structure of seven populations originating from the Pine River watershed in the Driftless Area of Wisconsin. However, given the diverse stocking sources used in the Wisconsin, we also included four additional populations (West Branch of Mill Creek, Lowery Creek, South Fork of the Hay River, and Lawrence Creek) that have historically or currently serve as wild brood or translocation sources. In addition, we also sampled two additional populations near the Pine River watershed in the Driftless Area.
Library preparation
Total genomic DNA was extracted using the Promega Wizard Genomic DNA purification kit (Promega Corporation, Madison, Wisconsin). DNA concentrations were estimated using Quant-iT dsDNA PicoGreen assays (ThermoFisher Scientific, Waltham, Massachusetts) and a BioTek Synergy HTX Plate Reader (Agilent Technologies, Santa Clara, California). Approximately 100ng of each sample was digested in 2µl reactions with restriction enzyme SbfI (New England Biolabs, Ipswich, Massachusetts). Barcodes were then ligated to digested samples and sheared to fragment lengths of 300-500bp using 12-14 30-second cycles of a Q500 sonicator (Qsonica, Newtown, Connecticut). These fragments were subsequently bound to Dynabeads M-280 Streptavidin magnetic beads (ThermoFisher Scientific, Waltham, Massachusetts) and washed to remove non-target sequences. We then size-selected these fragments using AMPure XP beads (Beckman Coulter, Brea, California) targeting an insert size of 250bp and enriched our libraries following the NEBNext Ultra DNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, Massachusetts) using plate-specific i7 barcodes and 12 cycles of PCR. Enriched and size-selected libraries were visually confirmed on a 2% agarose E-Gel (ThermoFisher Scientific, Waltham, Massachusetts) before a final AMPure XP bead purification. Final libraries were quantified using Quant-iT dsDNA PicoGreen assays, pooled to equimolar concentrations, and sequenced on an Illumina Novaseq 6000 using 1.4 lanes of 2x150bp S4 chemistry (Novogene, Sacramento, California).
Sequencing, genotyping, and filtering
Raw reads were processed using stacks v2.3 (Catchen et al. 2011, 2013). Briefly, sequences were demultiplexed by i5 and i7 barcodes, filtered for SbfI restriction sites, and trimmed to lengths of 140bp (i.e., -e SbfI -c -q -r -t 140 --filter_illumina --bestrad). Individual-specific reads passing these filters were aligned with ustacks and parameters previously shown to avoid under- and over-merging loci from salmonid RAD datasets (i.e., --disable-gapped -m 3 -M 5 -H --max_locus_stacks 4 --model_type bounded --bound_high 0.05). A catalog of loci was then built using all resultant stacks and the cstacks module (-n 3 -p 6 --disable_gapped). Individual-specific stacks were then matched against the locus catalog using the sstacks module (--disable_gapped). Paired end reads were then imported and all data were transposed by locus using tsv2bam. Paired end contigs were assembled and SNPs were called with the gstacks module. Resultant individual genotypes were exported in variant call format for loci genotyped in at least 30% of individuals using the populations module (-r 0.3). We further used vcftools v0.1.16 to filter this dataset by sequentially removing loci with minor allele counts less than three, loci that were not successfully genotyped in ≥70% of individuals in each population, and individuals that were not genotyped at ≥50% of loci (--mac 3, --exclude-positions, and --remove, respectively; Danecek et al. 2011). Paralogous loci, resultant from a recent whole genome duplication event in salmonids, were identified and removed using HDPlot with absolute read deviations greater than 8 and observed heterozygosity less than 0.5. Finally, as some of our downstream analyses assume independence of loci, we further filtered this dataset by retaining only one SNP per contig and chose to retain the SNP with the highest minor allele frequency in these instances.