Genotyping-in-Thousands by sequencing (GT-seq) genotyped data used for panel validation in GSI
Data files
Sep 30, 2024 version files 1.29 MB
-
Dryad_GTseqRubias_Set1_ARCH-CBAY.txt
500.89 KB
-
Dryad_GTseqRubias_Set2_BKT-MSL.txt
172.62 KB
-
Dryad_GTseqRubias_Set3_BKT-JB.txt
232.70 KB
-
Dryad_GTseqRubias_Set4_LKWF-GSL.txt
158.68 KB
-
Dryad_GTseqRubias_Set5_LKWF-JB.txt
216.77 KB
-
GTseqValidation_Rubias_GeneralCode.R
4.78 KB
-
README.md
2.83 KB
Abstract
Single Nucleotide Polymorphism (SNP) panels are powerful tools for assessing genetic population structure and dispersal of fishes and can enhance management practices for commercial, recreational, and subsistence mixed-stock fisheries. Arctic Char (Salvelinus alpinus), Brook Trout (Salvelinus fontinalis), and Lake Whitefish (Coregonus clupeaformis) are amongst the most harvested and consumed fish species in northern Indigenous communities in Canada, contributing significantly to food security, culture, tradition, and economy. However, genetic resources supporting Indigenous fisheries have not been widely accessible to northern communities (e.g., Inuit, Cree, and Dene). Here, we developed Genotyping-in-Thousands by sequencing (GT-seq) panels for population assignment and mixed-stock analyses of three salmonids, to support fisheries stewardship or co-management in northern Canada. Using low-coverage Whole Genome Sequencing data from 943 individuals across source populations in Cambridge Bay (Nunavut), Great Slave Lake (Northwest Territories), James Bay (Québec), and Mistassini Lake (Québec), we developed a bioinformatic SNP filtering workflow to select informative SNP markers from genotype likelihoods. These markers were then used to design GT-seq panels, thus enabling high-throughput genotyping for these species. The three GT-seq panels yielded an average of 413 autosomal loci and were validated with an average assignment accuracy of 83.03%. Thus, these GT-seq panels are powerful tools for assessing population structure and quantifying the relative contributions of populations/stocks in mixed stock fisheries across multiple regions. Interweaving these genomic-derived tools with Traditional Ecological Knowledge will ensure the sustainable harvest of three culturally important salmonids in Indigenous communities, contributing to food security programs and the economy in northern Canada.
README: GT-seq genotyped data used for panel validation in GSI.
https://doi.org/10.5061/dryad.jwstqjqjk
Description of the data and file structure
Genotyping data in rubias format for each of the five (5) species/region datasets used in our paper for estimating assignment accuracies for GSI studies. See manuscript for further details.
- Arctic Charr in Cambridge Bay (Dryad_GTseqRubias_Set1_ARCH-CBAY.txt):
- Five pops, 95 samples for the training set (TS) and 196 samples for the holdout set (HS) all genotyped at 398 SNPs.
- Brook Trout in James Bay (Dryad_GTseqRubias_Set3_BKT-JB.txt):
- Four pops, 56 samples for the training set (TS), 64 samples for the holdout set (HS) and 41 samples for the independent set (IS) all genotyped at 328 SNPs
- Brook Trout in Mistassini Lake (Dryad_GTseqRubias_Set2_BKT-MSL.txt):
- Three pops, 38 samples for the training set (TS), 10 samples for the holdout set (HS) and 50 samples for the independent set (IS) all genotyped at 393 SNPs
- Lake Whitefish in James Bay (Dryad_GTseqRubias_Set5_LKWF-JB.txt):
- Three pops, 37 samples for the training set (TS), 81 samples for the holdout set (HS) and 29 samples for the independent set (IS) all genotyped at 345 SNPs
- Lake Whitefish in Great Slave Lake (Dryad_GTseqRubias_Set4_LKWF-GSL.txt):
- Two pops, 42 samples for the training set (TS), 54 samples for the holdout set (HS) all genotyped at 354 SNPs
Files and variables
Description: Each txt file is a dataset used for running infer_mixture and self_assign functions in the R package rubias
Variables
Each file is structured as follows:
sample_type: mixture or reference
repunit: for reference samples a population ID is given otherwise, for "mixture" samples reporting unit is "NA"
collection: for all samples popID
indiv: an individual ID
set: samples are denoted as part of the TS (Training set), HS (Holdout set ) or IS (Independent set) accordingly
Loci columns: Two columns per locus. Alleles are expressed as characters (nucleotides: A,C,T,G) and missing data at a locus is expressed with NA values for each gene copy (column) at the locus
Code/software
All analyses for validating the power of the panels for GSI were performed in R, using the package rubias. Generally, we follow recommendations provided by Eric Anderson, the developer, found in this link.
A general code to run all five genotype datasets included in the publication is provided as a separate file on Dryad "GTseqValidation_Rubias_GeneralCode.R"
Methods
793 fish samples were genotyped using GT-seq panel developed in this study. After filtering (SNP with >50% missing data; samples with>30% missing data) the final datasets were converted in rubias format for each of the five (5) species/region datasets used in our paper for estimating assignment accuracies for GSI studies.