Data from: Conservation genetics of the endangered California Freshwater Shrimp (Syncaris pacifica): Watershed and stream networks define gene pool boundaries
Data files
Mar 30, 2026 version files 34.98 MB
-
CFS_Arleq.arp
19.46 MB
-
dryad_sample_metadata.xlsx
16.04 KB
-
fasta.fa
15.32 MB
-
populations.stru
179.98 KB
-
README.md
3.12 KB
Abstract
Understanding genetic structure and diversity among remnant populations of rare species can inform conservation and recovery actions. We used a population genetic framework to delineate gene pool boundaries and estimate gene flow and effective population sizes for the endangered California Freshwater Shrimp Syncaris pacifica. Tissues of 101 individuals were collected from 11 sites in five watersheds, using non-lethal tissue sampling. Single Nucleotide Polymorphism markers were developed de novo using ddRAD-seq methods, resulting in 433 unlinked loci scored with high confidence and low missing data. We found evidence for strong genetic structure across the species range. Two hierarchical levels of significant differentiation were observed: (i) five clusters (regional gene pools, FST = 0.38 – 0.75) isolated by low gene flow were associated with watershed limits and (ii) modest local structure among tributaries within a watershed that are not connected through direct downstream flow (local gene pools, FST = 0.06 – 0.10). Sampling sites connected with direct upstream-to-downstream water flow were not differentiated. Our analyses suggest that regional watersheds are isolated from one another, with very limited (possibly no) gene flow over recent generations. This isolation is paired with small effective population sizes across regional gene pools (Ne = 62.4 – 147.1). Genetic diversity was variable across sites and watersheds (He = 0.09 – 0.22). Those with the highest diversity may have been refugia and are now potential sources of genetic diversity for other populations. These findings highlight which portions of the species range may be most vulnerable to future habitat fragmentation and provide management consideration for maintaining local effective population sizes and genetic connectivity.
ddRADseq data files outputted from Stacks pipeline program
ddRADseq data was collected in 2019 - 2021 following a modified version of the Peterson et al. (2012) protocol using SbfI and MseI enzymes. Data processing was completed using STACKS (Catchen et al. 2013)
fasta.fa consists of short reads from ddRADseq libraries. The file contains full sequences in FASTA format outputted from the STACKS pipeline.
populations.stru consists of short reads from ddRADseq libraries. The file contains sequences in STRUCTURE format outputted from the STACKS pipeline. Only one SNP per short read was retained to accommodate STRUCTURE assumptions.
CFS_Arleq.arp was generated by converting the FASTA file (fasta.fa) into ARLEQUIN input format.
Sample information
Excel spreadsheet containing metadata for each sample included in the ddRADseq dataset:
dryad_sample_metadata.xlsx
Data Dictionary: dryad_sample_metadata.xlsx
| Column name | Description | Units / Format | Notes |
|---|---|---|---|
| sample_name | Unique identifier assigned to each individual sample | Text | Matches IDs used in sequence files |
| Collection Site | Name of the stream or locality where the sample was collected | Text | Field-based site name |
| Watershed | Watershed to which the collection site belongs | Text | Used to define population structure |
| title | Short descriptive name for the sequencing sample | Text | May match sequence submission naming |
| library_strategy | Sequencing approach used to generate the data | Text | e.g., ddRAD sequencing |
| library_source | Type of biological material used for sequencing | Text | e.g., genomic DNA |
| library_selection | Method used to select DNA fragments for sequencing | Text | e.g., restriction enzyme digestion |
| library_layout | Whether sequencing was single-end or paired-end | Text | e.g., SINGLE or PAIRED |
| platform | Sequencing platform used | Text | e.g., Illumina |
| instrument_model | Specific sequencing instrument used | Text | e.g., HiSeq, NovaSeq |
| design_description | Brief description of sample preparation and experimental design | Text | May include protocol details |
More details
For detailed information about data collection, pipeline processing, and analysis, please see the main article.
A total of 101 samples were collected between October 2019 – July 2021, using a D-frame net, across the known species distribution range, from 11 stream segments in all five watersheds (see Fig. 1). Two or three abdominal appendages were non-invasively taken from 6 -11 individuals per sampling site, preserved in cold 95% ethanol on site, and stored at -20°C freezer before laboratory processing. DNA was extracted using a modified protocol with the DNEasy blood and tissue extraction kit (Qiagen, Valencia, CA). The DNA concentration of extractions was measured using a Qubit 3.0 fluorometer (Thermo Fisher Scientific).
We sent DNA extractions (10 ng/ul) to the Cornell Genomics Facility for double digest RAD-seq library preparation following Peterson et al. (2012), using a combination of the SbfI-high fidelity and MspI restriction enzymes, and 5 - 7 bp barcodes. Single-end sequencing was performed in an Illumina NextSeq500 platform for a total read length of 150 bp
Raw sequences were adapter-trimmed, de-multiplexed, filtered, and processed using programs wrapped in the Stacks pipeline (Catchen et al. 2013). To minimize bias in the final data set, guidelines (Paris et al. 2017; Rochette and Catchen 2017) suggest running the pipeline several times with different parameter combinations. Throughout this process, we sought to maximize accuracy (e.g., reducing presumed paralogous loci and low-confidence genetic data) while maintaining a high number of loci, sampling sites, and individuals per sampling site. We set the minimum number of raw reads required to form a stack to 3 (parameter m in ustacks), while different values were tested for the maximum mismatches between alleles (heterozygous) to form a locus (parameter M in ustacks) and the number of mismatches allowed between sample loci when building the alleles catalog (parameter n in cstacks). Parameter values are generally selected to yield a high number of polymorphic loci found in 80% (parameter r80 in populations) or more of the individuals within a sampling site. We retained only loci present in 80% of the samples and only alleles with an overall minor allele frequency of 0.02 or higher (--min-maf parameter in populations).
