Species-informative GT-seq markers for Columbia River salmonid fishes: Genotypic data and computing resources
Data files
Dec 19, 2025 version files 18.09 MB
Abstract
Genetic monitoring of Pacific salmon in the Columbia River basin provides crucial information to fisheries managers that is otherwise challenging to obtain using traditional methods. Monitoring programs such as genetic stock identification (GSI) and parentage-based tagging (PBT) involve genotyping tens of thousands of individuals annually. Although rare, these large sample collections inevitably include misidentified species, which exhibit low genotyping success on species-specific Genotyping-in-Thousands by sequencing (GT-seq) panels. For laboratories involved in large-scale genotyping efforts, diagnosing non-target species and reassigning them to the appropriate monitoring program can be costly and time-consuming. To address this problem, we identified 19 primer pairs that exhibit consistent cross-species amplification among salmonids and contain 51 species-informative variants. These genetic markers reliably discriminate among 11 salmonid species and two subspecies of Cutthroat Trout and have been included in species-specific GT-seq panels for Chinook Salmon, Coho Salmon, Sockeye Salmon, and Rainbow Trout, commonly used for Pacific salmon genetic monitoring. The majority of species-informative amplicons (16) were newly identified from the four existing GT-seq panels, thus demonstrating a low-cost approach to species-identification when using targeted sequencing methods. A species-calling script was developed that is tailored for routine GT-seq genotyping pipelines and automates the identification of non-target species. Following extensive testing with empirical and simulated data, we demonstrated that the genetic markers and accompanying script accurately identified species and are robust to missing genotypic data and low-frequency, shared polymorphisms among species. Finally, we used these tools to identify Coho Salmon incidentally caught in the Columbia River Chinook Salmon sport fishery and used PBT to determine their hatchery of origin. These molecular and computing resources provide a valuable tool for Pacific salmon conservation in the Columbia River basin and demonstrate a cost-effective approach to species identification for genetic monitoring programs.
https://doi.org/10.5061/dryad.nk98sf817
Multi-locus species-informative genotypes used in Robinson et al. 2024.
Description of the data and file structure:
- Multi-locus genotypes for Columbia River Salmonid samples processed by the Hagerman Genetics Lab, CRITFC.
- Contains individual information regarding field-identified species, genetic species identification, sample year, sample location, and species-informative genotypes.
- Multi-locus genotypes for Columbia River Salmonid samples processed by the Eagle Fish Genetics Lab, IDFG.
- Contains individual information regarding field-identified species, genetic species identification, sample year, sample location, and species-informative genotypes.
- multi-locus genotypes used to identify non-target species in a Chinook Salmon sport fishery sample collection in Washington and Oregon, USA, from 2021 to 2022.
- Contains individual information regarding genetic species identification, sample year, and species-informative genotypes.
Code/Software
Genotyping and species calling scripts are provided:
GTseq-pipeline
A slightly modified version of the Perl-based GTseq genotyping pipeline presented in Campbell et al. (2015)
Originally authored by Nate Campbell and later modified by Kim Vertacnik and Zak Robinson.
- GTseq_Genotyper_v5.pl now allows for multiple snps for amplicons. At present, each SNP is called independently, and phased haplotypes are not provided.
- A typo resulting in asymmetrical heterozygote calling based on the ratio of allelic counts has been corrected.
- An additional column, termed a "probeSeq flag," has been added to probeSeq files. These flags allow for accurate counting of on-target reads in GTseq_GenoCompile_v5.1.pl and to omit certain loci from genotyping success (%GT) filters. If the probeSeq flag is omitted from the file, all loci assume a value of 1.
| ProbeSeq Flag | Explanation |
|---|---|
| 0 | Sex marker with stand-alone script |
| 1 | Flagship SNP of amplicon |
| 2 | Secondary SNP |
| 3 | Flagship SNP of amplicon. Not subject to %GT filter |
| 4 | Secondary SNP. Not subject to %GT filter |
GTseq_Demultiplex.py operates on a non-demultiplexed FASTQ file from an Illumina sequencing run.
The provided example file <i>Undetermined_S0_R1_001.fastq.gz</i> was generated using the following command using a generic sample sheet:
bcl-convert --output-directory --force --bcl-input-directory --sample-sheet SampleSheetV2_GTseq.csv --strict-mode true --no-lane-splitting true
Using the provided example files:
./GTseq_Demultiplex_V1.1.py --barcodeFile NS2K_85-bc.csv --fastq Undetermined_S0_R1_001.fastq.gz --i5rc Y --seqType GTseq --gzipOutput N
This should result in three fastq files i202_03_TOSS2145_Ots_Harvest1.fastq, i202_09_TOSS2145_Ots_Harvest2.fastq, and i211_21_TOSS2154_Ots_Harvest3.fastq
Now GZIP and move them to a new directory:
gzip i*
mkdir READS
mv i* ./READS
cd READS
With one FASTQ:
./GTseq_Genotyper_v5.pl OtsGTseqV9.2_363-probeSeqs.csv i202_03_TOSS2145_Ots_Harvest1.fastq.gz > i202_03_TOSS2145_Ots_Harvest1.genos
Parallelized example:
ls | grep fastq | while read -r LINE; do
genos_out=$(echo $LINE | sed -e s/fastq.gz/genos/g);
echo "../../GTseq_Genotyper_v5.pl ../OtsGTseqV9.2_363-probeSeqs.csv $LINE > $genos_out" >> GenotypeCommands.cmds;
done
parallel -j 3 < GenotypeCommands.cmds > Genotyper.log 2>&1
# Execute sex marker script in directory with FASTQ and .genos files
OtsSEX_v3.1.pl --jobs=3 --ctrl_ratio=0.005
Move '.genos' files to their own directory
mkdir ../GENOS
mv *genos ../GENOS
cd ../GENOS
./GTseq_GenoCompile_v5.1.pl . --name=index --output=genotype --filter=0 > NS2K_85-GenosNF.csv
./GTseq_GenoCompile_v5.1.pl . --name=sample --output=genotype --filter=90 > NS2K_85-ProgGenos90.csv
Optional Analyses
Look for off-target species among these putative chinook:
./CallSpecies.py --SpeciesSeq SpeciesSeq.csv --inGENO NS2K_85-ProgGenos90.csv --outGENO NS2K_85-ProgGenos90_CALLED.csv
Refer to MANUAL_CALLSPECIES.pdf (Zenodo link in the Related Works section) for a detailed description.
Details on data collection and processing can be found in the corresponding manuscript (Robinson et al. 2024; DOI: 10.1111/eva.13680). Briefly, known-species samples of Columbia River salmonid fishes were obtained by various state, federal, and tribal agencies. All samples were processed using the GT-seq library preparation protocol and genotyping pipeline described in Campbell et al. (2015). Individual multi-locus genotypes from species-informative loci and collection metadata are provided. We also provide the genotyping pipelines and computing resources used to make genotypic and species calls.
- Robinson, Zachary (2025). Species-informative GT-seq markers for Columbia River salmonid fishes: Genotypic data and computing resources. Zenodo. https://doi.org/10.5281/zenodo.10988226
- Robinson, Zachary (2025). Species-informative GT-seq markers for Columbia River salmonid fishes: Genotypic data and computing resources. Zenodo. https://doi.org/10.5281/zenodo.10988225
- Robinson, Zachary L.; Stephenson, Jeff; Vertacnik, Kim et al. (2024). Efficient species identification for Pacific salmon genetic monitoring programs. Evolutionary Applications. https://doi.org/10.1111/eva.13680
