Supplementary data for: Outmigrating central valley Chinook Salmon
Data files
May 08, 2024 version files 141.43 MB
-
cv_juv_assignments.R
-
juvenile.1kaln_cutoff.snp_panel.singlereadcalls.csv
-
juvenile.assignment_posteriors_calls.csv
-
leave_one_out.flfVws.R
-
leave_one_out.fVlf.R
-
leave_one_out.maj_pop_1.results.csv
-
leave_one_out.maj_pop_2.results.csv
-
leave_one_out.sbVsmd.R
-
leave_one_out.subpop_fall_V_latefall.results.csv
-
leave_one_out.subpop_sprButte_V_sprMillDeer.results.csv
-
leave_one_out.winterVall.R
-
README.md
-
sample_metadata.csv
-
trainingset_allruns_filtered.pop_key.csv
-
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
-
validation_analyses.R
-
validation_set.assignment_results.csv
-
validation_set.pop_key.csv
-
validation_set.snp_panel.singlereadcalls.csv
Abstract
Intraspecific diversity plays a critical role in the resilience of Chinook salmon populations. California’s Central Valley historically hosted one of the most diverse population complexes of Chinook salmon in the world. However, anthropogenic factors have dramatically decreased this diversity, with severe consequences for population resilience. Here we use next generation sequencing and an archive of thousands of tissue samples collected across two decades during the juvenile outmigration to evaluate phenotypic diversity between and within populations of Central Valley Chinook salmon. To account for highly heterogeneous sample qualities in the archive dataset, we develop and test an approach for population and subpopulation assignments of Central Valley Chinook salmon that allows inclusion of relatively low-quality samples while controlling error rates. We find significantly distinct outmigration timing and body size distributions for each population and subpopulation. Within the archive dataset, spring run individuals that assigned to the Mill and Deer Creeks subpopulation exhibited an earlier and broader outmigration distribution as well as larger body sizes than individuals that assigned to the Butte Creek subpopulation. Within the fall run population, individuals that assigned to the late-fall run subpopulation also exhibited an earlier and broader outmigration distribution and larger body sizes than other fall run fish in our dataset. These results highlight the importance of distinct subpopulations for maintaining remaining diversity in Central Valley Chinook salmon, and demonstrates the power of genomics-based population assignments to aid the study and management of intraspecific diversity.
README: Outmigrating Central Valley Chinook Salmon
Access this data on Dryad (DOI: 10.5061/dryad.280gb5mxx)
This Dryad entry contains data files and scripts used for analyses in Thompson et al. 2024 (Evolutionary Applications). Briefly, the study examines outmigration characteristics of juvenile Chinook salmon collecte
d at Chipps Island in the Sacramento/San Joaquine Delta by assigning each sample to a population of origin. The analysis is broken down into three parts: 1) leave-one-out analyses to develop a population assignme
nt method and evaluate its expected efficacy; 2) validation of the assignment method using an independent dataset of known-origin samples; 3) population assignment of unknown-origin juvenile samples collected at
Chipps Island (and obtained from a tissue archive). This Dryad entry contains files and scripts necessary for that analysis, as well as the resulting output files.
Data Files
The following are descriptions of all data files uploaded here.
(Note: the single read call files described below were generated from sequencing data using the program ANGSD (https://www.popgen.dk/angsd/index.php/ANGSD version0.935). See Methods in Thompson et al. (2024) for full descripton of sequencing data sources/generation.)
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
File with allele calls from single read sampling for all samples in the training set (sequencing data available from Meek et al., 2020) at each site in the SNP panel. Each line contains data for one SNP in the SN
P panel. Allele calls are coded as 0 for the major allele, 1 for the minor allele, and -1 for missing data. See below for an explanation of column labels:
chr Chromosome where the SNP for that line is located
pos Position of the SNP on the chromosome
major The major allele in A,C,G,T format
minor The minor allele in A,C,G,T format
ind[#] All following columns contain allele call data for individual samples. Column name corresponds to sample ID. See trainingset_allruns_filtered.pop_key.csv for specific information on each sample.
trainingset_allruns_filtered.pop_key.csv
File linking training set sample IDs to true populations and subpopulations of origin. See below for column descriptions:
ibs_file_id Sample IDs corresponding to file names in trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
sample Original sample names from Meek et al., 2020
major_pop1 Population of origin relevant for the first population assignment step described in this manuscript
major_pop2 Population of origin relevant for the second population assignment step described in this manuscript
sup_pop Subpopulation of origin
minor_sub_pop Minor subpopulation of origin for individuals whose sub_pop = WildSpring_MillDeer (members of other subpopulations have NA listed in this column)
leave_one_out.maj_pop_1.results.csv
Concatenated results file containing info and population assignment posterior probabilities for the leave-one-out analysis of assignment step 1 (Winter vs. fall/late-fall/WildSpring). See below for column descrip
tions:
Individual sample ID corresponding to the IDs in trainingset_allruns_filtered.pop_key.csv
Missingness The proportion of missing allele calls (e.g., 0.2 = 20% missing data)
Post.Fall_LateFall_WildSpring The posterior probability from DAPC of assignment to the Fall/Late-fall/WildSpring population
Post.Winter The posterior probability from DAPC of assignment to the Winter population
True_pop The true population of origin for the given sample
leave_one_out.maj_pop_2.results.csv
Concatenated results file containing info and population assignment posterior probabilities for the lea
ve-one-out analysis of assignment step 2 (Fall/Late-fall vs. Wild Spring). See below for column descriptions:
Individual sample ID corresponding to the IDs in trainingset_allruns_filtered.pop_key.csv
Missingness The proportion of missing allele calls (e.g., 0.2 = 20% missing data)
Post.Fall_LateFall The posterior probability from DAPC of assignment to the Fall/L
ate-fall population
Post.WildSpring The posterior probability from DAPC of assignment to the Wild Spring population
True_pop The true population of origin for the given sample
leave_one_out.subpop_sprButte_V_sprMillDeer.results.csv
Concatenated results file containing info and population assignment posterior probabilities for the lea
ve-one-out analysis of assignment step 3a (Spring Butte vs. Spring Mill/Deer). See below for column descriptions:
Individual Sample ID corresponding to the IDs in trainingset_allruns_filtered.pop_key.csv
Missingness The proportion of missing allele calls (e.g., 0.2 = 20% missing data)
Post.WildSpring_Butte The posterior probability from DAPC of assignment to the Wild Spring Butte subpopulation
Post.WildSpring_MillDeer The posterior probability from DAPC of assignment to the Wild Spring Mill/Deer subpopulation
True_pop The true subpopulation of origin for the given sample
leave_one_out.subpop_fall_V_latefall.results.csv
Concatenated results file containing info and population assignment posterior probabilities for the lea
ve-one-out analysis of assignment step 3b (fall vs. late-fall). See below for column descriptions:
Individual sample ID corresponding to the IDs in trainingset_allruns_filtered.pop_key.csv
Missingness The proportion of missing allele calls (e.g., 0.2 = 20% missing data)
Post.Fall The posterior probability from DAPC of assignment to the Fall subpopulation
Post.LateFall The posterior probability from DAPC of assignment to the Late Fall subpopulation
True_pop The true subpopulation of origin for the given sample
validation_set.snp_panel.singlereadcalls.csv
File with allele calls from single read sampling for all samples in the validation dataset (sequencing data available from Baerwald et al., 2023) at each site in the SNP panel. Each line contains data for one SNP in the SNP panel. Allele calls are coded as 0 for the major allele, 1 for the minor allele, and -1 for missing data. See below for descriptions of column labels:
chr Chromosome where the SNP for that line is located
pos Position of the SNP on the chromosome
major The major allele in A,C,G,T format
minor The minor allele in A,C,G,T format
ind[#] All following columns contain allele call data for individual samples. Column name corr
esponds to sample ID. See validation_set..pop_key.csv for specific information on each sample.
validation_set.pop_key.csv
File linking validation set sample IDs to true populations and subpopulations of origin. See below for column descriptions:
singleread_file_id Sample IDs corresponding to file names in validation_set.snp_panel.singlereadcalls.csv
sample Original sample names from Baerwald, et al. (2023)
major_pop1 Population of origin relevant for the first population assignment step described in this manuscript
major_pop2 Population of origin relevant for the second population assignment step described in this manuscript
sup_pop Subpopulation of origin
minor_sub_pop Minor subpopulation of origin for individuals whose sub_pop = WildSpring_MillDeer (members of other subpopulations have NA listed in this column)
validation_set.assignment_results.csv
File containing validation dataset population assignment results. See below for column descriptions:
Sample Sample IDs corresponding to file names in validation_set.pop_key.csv.
Fall_LateFall_WildSpring The posterior probability from DAPC of assignment to the Fall/Late-fall/WildSpring population in assignment step 1.
Winter The posterior probability from DAPC of assignment to the Winter population in assignment step 1.
Fall_LateFall The posterior probability from DAPC of assignment to the Fall/Late-fall population in assignment step 2.
WildSpring The posterior probability from DAPC of assignment to the WildSpring population in assignment step 2.
WildSpring_Butte The posterior probability from DAPC of assignment to the WildSpring Butte subpopulation in assignment step 3a.
WildSpring_MillDeer The posterior probability from DAPC of assignment to the WildSpring Mill/Deer subpopulation in assignment step 3a.
Fall The posterior probability from DAPC of assignment to the Fall subpopulation in assignment step 3b.\
LateFall The posterior probability from DAPC of assignment to the Late-Fall subpopulation in assignment step 3b.
Missingness The proportion of missing data for the sample from the validation_set.snp_panel.singlereadcalls.csv file (e.g. 0.61 = 61% missing allele calls)
Maj.pop1.Call Population call based on the greatest posterior assignment probability for assignment step 1.
Maj.pop2.Call Population call based on the greatest posterior assignment probability for assignment step 2 (note: samples that assigned to "Winter" in step 1 were not assigned at this step and have NA listed in this column).
Spring.subpop.Call Subpopulation call based on greatest posterior assignment probability for assignment step 3a (note: samples with a Maj.popo1.Call of "Winter" or a Maj.pop2.Call of "Fall/LateFall" were note assigned to a WildSpring subpopulation and have NA listed for this column).
Fall.subpop.Call Subpopulation call based on greatest posterior assignment probability for assignment step 3b (note: samples with a Maj.pop1.Call of "Winter" or a Maj.pop2.Call of "WildSpring" were not assigned to a fall subpopulation and have NA listed for this column).
true_maj.pop1 The true population of origin for this sample at assignment step 1.
true_maj.pop2 The true population of origin for this sample at assignment step 2 (note: samples whose true origin is "Winter" have NA listed in this column).
true_subpop The true subpopulation of origin of this sample (note: samples whose true origin is "Winter" are not part of a subpopulation and have NA listed in this column).
juvenile.1kaln_cutoff.snp_panel.singlereadcalls.csv
File with allele calls from single read sampling for samples in the archive juvenile dataset at each site in the SNP panel. Samples that failed the 1,000 properly-paired sequencing read cutoff are excluded. Each line contains data for one SNP in the SNP panel. Allele calls are coded as 0 for the major allele, 1 for the minor allele, and -1 for missing data. See below for descriptions of column labels:
chr Chromosome where the SNP for that line is located
pos Position of the SNP on the chromosome
major The major allele in A,C,G,T format
minor The minor allele in A,C,G,T format
ind[#] All following columns contain allele call data for individual samples. Column name corr
esponds to sample ID. See sample_metadata.csv for specific information on each sample.
sample_metadata.csv
File containing metadata (collection info, phenotype data, etc.) for all archive juvenile samples. See below for column descriptions:
Order_all Useful column for sorting samples. Number refers to the order of all samples in the DNA plates
Order_1k_cutoff Useful column for sorting samples. Number refers to order of samples in DNA plates that passed the 1,000 sequencing read filter
Sample_ID CDFW identification code for each sample
Plate_name Name of the DNA plate the sample's DNA is a part of
Well The well location of the sample within the DNA plate
Sample_name Sample name combining DNA plate and well info
Name_in_singleread_file The name of the sample in juvenile.1kaln_cutoff.snp_panel.singlereadcalls.csv
SampleDate Sample collection date at Chipps Island (note: a few samples were found to have ambiguous data data after sequencing--indicated by NA's--their information is included in this spreadsheet, but they were excluded from downstream analyses)
SampleYear Sample collection year (note: a few samples were found to have ambiguous date data after sequencing. These contain an NA in this column, and were excluded from downstream analyses)
CollLoc Specific collection location
properpair_aligment_count The final number of properly paired alignments to the reference genome for each sample
make_1k_cutoff? Indicates whether the sample passed the 1,000 read filtering cutoff for inclusion in downstream analyses
CDFW_race Population assignment based on length-at-date criteria given by CDFW to some samples at the time of collection
Fork_length Fork lengths of each juvenile in millimeters
Ad_clip Indicates whether an adipose fin was present when the tissue sample was collected
NC_056456.1:13427410 Genotype call for a SNP in the GREB1L region associated with run timing
greb1l_posterior Posterior probability of the genotype call for NC_056456.1:13427410
juvenile.assignment_posteriors_calls.csv
File containing archive juvenile dataset population assignment results. See below for column descriptions:
Sample Sample name
Missingness The proportion of missing data for a given sample from the juvenile.1kaln_cutoff.snp_panel.singlereadcalls.csv file (e.g. 0.61 = 61% missing allele calls)
Fall_LateFall_WildSpring The posterior probability from DAPC of assignment to the Fall/L
ate-fall/WildSpring population in assignment step 1.
Winter The posterior probability from DAPC of assignment to the Winter population in assignmen
t step 1.
Fall_LateFall The posterior probability from DAPC of assignment to the Fall/Late-fall populat
ion in assignment step 2.
WildSpring The posterior probability from DAPC of assignment to the WildSpring population
in assignment step 2.
WildSpring_Butte The posterior probability from DAPC of assignment to the WildSpring But
te subpopulation in assignment step 3a.
WildSpring_MillDeer The posterior probability from DAPC of assignment to the WildSpring Mil
l/Deer subpopulation in assignment step 3a.
Fall The posterior probability from DAPC of assignment to the Fall subpopulation in assignme
nt step 3b.
LateFall The posterior probability from DAPC of assignment to the Late-Fall subpopulatio
n in assignment step 3b.
Maj.pop1.Call Population call based on the greatest posterior assignment probability for assignment step 1.
Maj.pop2.Call Maj.pop2.Call Population call based on the greatest posterior assignment probability for assignment step 2 (note: samples assigning to "Winter" in step 1 were not assigned at step 2 and have an NA in this column).
Spring.subpop.Call Subpopulation call based on greatest posterior assignment probability f
or assignment step 3a (note: samples with a Maj.popo1.Call of "Winter" or a Maj.pop2.Call of "Fall/LateFall" were not assigned to a WildSpring subpopulation and have NA listed for this column).
Fall.subpop.Call Subpopulation call based on greatest posterior assignment probability for assignment step 3b (note: samples with a Maj.pop1.Call of "Winter" or a Maj.pop2.Call of "WildSpring" were not assigned to a fall subpopulation and have NA listed for this column).
Sharing/Access Infromation
The training set data was derived from previously published sequencing data publicly available from Meek et al.(2020) (doi.org/10.1139/cjfas-2019-0171)
The validatiaon data set was derived from previously published sequencing data publicly available from Baerwald et al. (2023) (doi.org/10.1111/1755-0998.13777)
Code/Software
The following describes scripts used for analyses, required R packages, required input files for each script, and the resulting output files. The scripts are included in this Dryad entry. All scripts are written in R, and the following R version was used when performing analyses:
R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics".
leave_one_out.winterVall.R
Runs leave-one-out analyses for winter vs all other pops (assignment step 1)
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
Output files:
leave_one_out.maj_pop_1.results.csv (concatenated)
leave_one_out.flfVws.R
Runs leave-one-out analyses for fall/late-fall vs. wild spring (assignment step 2)
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
Output files:
leave_one_out.maj_pop_2.results.csv (concatenated)
leave_one_out.sbVsmd.R
Runs leave-one-out analyses for spring Butte vs. spring Mill/Deer (assignment step 3a)
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
Output files:
leave_one_out.subpop_sprButte_V_sprMillDeer.results.csv (concatenated)
leave_one_out.fVlf.R
Runs leave-one-out analyses for fall vs. late-fall (assignment step 3b)
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
Output files:
leave_one_out.subpop_fall_V_latefall.results.csv (concatenated)
validation_analyses.R
Generates population assignments for validataion sample set
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
ggplot2 (3.3.6)
dplyr (1.1.2)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
validation_set.snp_panel.singlereadcalls.csv
validation_set.pop_key.csv
Output files:
validation_set.assignment_results.csv
cv_juv_assignments.R
Generates population assignments for archive juvenile sample set
Required R packages (with version used at time of analysis):
adegenet (2.1.6)
dplyr (1.1.2)
Required input files:
trainingset_allruns_filtered.snp_panel.singlereadcalls.csv
trainingset_allruns_filtered.pop_key.csv
juvenile.1kaln_cutoff.snp_panel.singlereadcalls.csv
Output files:
juvenile.assignment_posteriors_calls.csv