Skip to main content
Dryad

Data from: harnessing the power of regional baselines for broad-scale genetic stock identification: a multistage, integrated, and cost-effective approach

Cite this dataset

Hsu, Bobby; Habicht, Christopher (2023). Data from: harnessing the power of regional baselines for broad-scale genetic stock identification: a multistage, integrated, and cost-effective approach [Dataset]. Dryad. https://doi.org/10.5061/dryad.qbzkh18qk

Abstract

In mixed-stock fishery analyses, genetic stock identification (GSI) estimates the contribution of each population to a mixture and is typically conducted at a regional scale using genetic baselines specific to the stocks expected in that region. Often these regional baselines cannot be combined to produce broader geographical baselines due to non-overlapping populations and genetic markers. In cases where the mixture contains stocks spanning across a wide area, a broad-scale baseline is created, but often at the cost of resolution. Here, we introduce a new GSI method to harness the resolution capabilities of baselines developed for regional applications in the analysis of mixtures containing individuals from a broad geographic range. This method employs a multistage framework that allows disparate baselines to be used in a single integrated process that produces estimates along with the propagated errors from each stage. All individuals in the mixture sample are required to be genotyped for all genetic markers in the baselines used by this model, but the baselines do not require overlap in genetic markers or populations representing the broad-scale or regional baselines.

We demonstrate our integrated multistage GSI model using a synthesized data set made up of Chinook salmon, Oncorhynchus tshawytscha, from the North Bering Sea of Alaska. The data set is designed to be run using R package, Ms.GSI, and it does not represent the composition of the real fishery. The results show an improved accuracy for estimates using an integrated multistage framework, compared to the conventional framework of using separate hierarchical steps. The integrated multistage framework allows GSI of a wide geographic area without first developing a large scale, high-resolution genetic baseline or dividing a mixture sample into smaller regions beforehand. This approach is more cost-effective than updating range-wide baselines with all regionally important markers.

README: Harnessing the power of regional baselines for broad-scale genetic stock identification: A multistage, integrated, and cost-effective approach

https://doi.org/10.5061/dryad.qbzkh18qk

Description of the data and file structure

  • mix_ayk_205.rda is the mixture containing 205 individuals in the format of the Gene Conservation Lab. Columns 1 and 2 are identifications of each individual. "SillySource" column is the unique individual ID. "SILLY_CODE" is the location of the collection. SillySource and SILLY_CODE are known as "indiv" and "collection" in R package rubias format, respectively. Columns 3 and on identify the loci and their alleles for each individual. Missing genotypes are represented by "NA."
  • base_ayk10.rda is the broad-scale baseline, and it is in the same format as the mixture file.
  • base_yukon380_5grps.rda is the regional baseline in the same format as the mixture and broad-scale baseline files.
  • ayk_pops60_5grps.rda is a table containing information for the broad-scale populations. "collection" column contains the name of the population. "repunit" contains the names of assigned reporting unit. "grpvec" contains the corresponding numbers for the reporting units.
  • yukon_pops43_4grps.rda is a table containing information for the regional populations. It has the same format as the table for the broad-scale populations
  • msgsi_supp_data.r is a file containing R code to run the supplemental data using R package Ms.GSI.

Methods

We used existing genetic baseline to simulate synthetic mixture for this analysis. See article for details.

Funding

Alaska Department of Fish and Game