Data from: Conservation genetics of the endangered California Freshwater Shrimp (Syncaris pacifica): Watershed and stream networks define gene pool boundaries

Ada, Abdul M.1 ; Vandergast, Amy G.2 ; Fisher, Robert N.2 ; Fong, Darren3 ; Bohonak, Andrew J.4

Published Mar 30, 2026 on Dryad. https://doi.org/10.5061/dryad.41ns1rnnp

Data files

Mar 30, 2026 version files 34.98 MB

CFS_Arleq.arp

19.46 MB
dryad_sample_metadata.xlsx

16.04 KB
fasta.fa

15.32 MB
populations.stru

179.98 KB
README.md

3.12 KB

Abstract

Understanding genetic structure and diversity among remnant populations of rare species can inform conservation and recovery actions. We used a population genetic framework to delineate gene pool boundaries and estimate gene flow and effective population sizes for the endangered California Freshwater Shrimp Syncaris pacifica. Tissues of 101 individuals were collected from 11 sites in five watersheds, using non-lethal tissue sampling. Single Nucleotide Polymorphism markers were developed de novo using ddRAD-seq methods, resulting in 433 unlinked loci scored with high confidence and low missing data. We found evidence for strong genetic structure across the species range. Two hierarchical levels of significant differentiation were observed: (i) five clusters (regional gene pools, F_ST = 0.38 – 0.75) isolated by low gene flow were associated with watershed limits and (ii) modest local structure among tributaries within a watershed that are not connected through direct downstream flow (local gene pools, F_ST = 0.06 – 0.10). Sampling sites connected with direct upstream-to-downstream water flow were not differentiated. Our analyses suggest that regional watersheds are isolated from one another, with very limited (possibly no) gene flow over recent generations. This isolation is paired with small effective population sizes across regional gene pools (N_e = 62.4 – 147.1). Genetic diversity was variable across sites and watersheds (H_e = 0.09 – 0.22). Those with the highest diversity may have been refugia and are now potential sources of genetic diversity for other populations. These findings highlight which portions of the species range may be most vulnerable to future habitat fragmentation and provide management consideration for maintaining local effective population sizes and genetic connectivity.

ddRADseq data files outputted from Stacks pipeline program

ddRADseq data was collected in 2019 - 2021 following a modified version of the Peterson et al. (2012) protocol using SbfI and MseI enzymes. Data processing was completed using STACKS (Catchen et al. 2013)

fasta.fa consists of short reads from ddRADseq libraries. The file contains full sequences in FASTA format outputted from the STACKS pipeline.

populations.stru consists of short reads from ddRADseq libraries. The file contains sequences in STRUCTURE format outputted from the STACKS pipeline. Only one SNP per short read was retained to accommodate STRUCTURE assumptions.

CFS_Arleq.arp was generated by converting the FASTA file (fasta.fa) into ARLEQUIN input format.

Sample information

Excel spreadsheet containing metadata for each sample included in the ddRADseq dataset:

dryad_sample_metadata.xlsx

Data Dictionary: dryad_sample_metadata.xlsx

Column name	Description	Units / Format	Notes
sample_name	Unique identifier assigned to each individual sample	Text	Matches IDs used in sequence files
Collection Site	Name of the stream or locality where the sample was collected	Text	Field-based site name
Watershed	Watershed to which the collection site belongs	Text	Used to define population structure
title	Short descriptive name for the sequencing sample	Text	May match sequence submission naming
library_strategy	Sequencing approach used to generate the data	Text	e.g., ddRAD sequencing
library_source	Type of biological material used for sequencing	Text	e.g., genomic DNA
library_selection	Method used to select DNA fragments for sequencing	Text	e.g., restriction enzyme digestion
library_layout	Whether sequencing was single-end or paired-end	Text	e.g., SINGLE or PAIRED
platform	Sequencing platform used	Text	e.g., Illumina
instrument_model	Specific sequencing instrument used	Text	e.g., HiSeq, NovaSeq
design_description	Brief description of sample preparation and experimental design	Text	May include protocol details

More details

For detailed information about data collection, pipeline processing, and analysis, please see the main article.

A total of 101 samples were collected between October 2019 – July 2021, using a D-frame net, across the known species distribution range, from 11 stream segments in all five watersheds (see Fig. 1). Two or three abdominal appendages were non-invasively taken from 6 -11 individuals per sampling site, preserved in cold 95% ethanol on site, and stored at -20°C freezer before laboratory processing. DNA was extracted using a modified protocol with the DNEasy blood and tissue extraction kit (Qiagen, Valencia, CA). The DNA concentration of extractions was measured using a Qubit 3.0 fluorometer (Thermo Fisher Scientific).

We sent DNA extractions (10 ng/ul) to the Cornell Genomics Facility for double digest RAD-seq library preparation following Peterson et al. (2012), using a combination of the SbfI-high fidelity and MspI restriction enzymes, and 5 - 7 bp barcodes. Single-end sequencing was performed in an Illumina NextSeq500 platform for a total read length of 150 bp

Raw sequences were adapter-trimmed, de-multiplexed, filtered, and processed using programs wrapped in the Stacks pipeline (Catchen et al. 2013). To minimize bias in the final data set, guidelines (Paris et al. 2017; Rochette and Catchen 2017) suggest running the pipeline several times with different parameter combinations. Throughout this process, we sought to maximize accuracy (e.g., reducing presumed paralogous loci and low-confidence genetic data) while maintaining a high number of loci, sampling sites, and individuals per sampling site. We set the minimum number of raw reads required to form a stack to 3 (parameter m in ustacks), while different values were tested for the maximum mismatches between alleles (heterozygous) to form a locus (parameter M in ustacks) and the number of mismatches allowed between sample loci when building the alleles catalog (parameter n in cstacks). Parameter values are generally selected to yield a high number of polymorphic loci found in 80% (parameter r80 in populations) or more of the individuals within a sampling site. We retained only loci present in 80% of the samples and only alleles with an overall minor allele frequency of 0.02 or higher (--min-maf parameter in populations).