Sharo, Andrew 1 ; Supple, Megan1; Cabrera, Randy1; Seligmann, William1; Sacco, Samuel1; Columbus, Cassondra1; Pearse, Devon1; Shapiro, Beth1; Garza, John Carlos1

Published Jun 15, 2024 on Dryad. https://doi.org/10.5061/dryad.g79cnp5z9

Steelhead/rainbow trout (Oncorhynchus mykiss) is an imperiled salmonid with two main life history strategies: migrate to the ocean or remain in freshwater. Domesticated hatchery forms of this species have been stocked into almost all California waterbodies, possibly resulting in introgression into natural populations and altered population structure.

We compared whole-genome sequence data from contemporary populations against a set of museum population samples of steelhead from the same locations that were collected prior to most hatchery stocking.

We observed minimal introgression and few steelhead-hatchery trout hybrids despite a century of extensive stocking. Our historical data show signals of introgression with a sister species and indications of an early hatchery facility. Finally, we found that migration-associated haplotypes have become less frequent over time, a likely adaptation to decreased opportunities for migration. Since contemporary migration-associated haplotype frequencies have been used to guide species management, we consider this to be a rare example of shifting baseline syndrome that has been validated with historical data.

We suggest cautious optimism that a century of hatchery stocking has had minimal impact on California steelhead population genetic structure, but we note that continued shifts in life history may lead to further declines in the ocean-going form of the species.

https://doi.org/10.5061/dryad.g79cnp5z9

This directory contains the data and code to reproduce analyses in the manuscript "Recent adaptation in a threatened salmonid revealed by museum genomics." Most of these data are input to or output from methods commonly used in conservation genetics. These data are largely derived from low-coverage whole genome sequencing of 75 historical steelhead trout, 75 contemporary steelhead trout, and 50 contemporary hatchery rainbow trout. Please see the manuscript for additional details.

Modern and historical natural populations:

Eel River
Coyote Creek
San Lorenzo River
Llagas Creek
Nacimiento River

Hatchery strains:

Coleman
Eagle Lake
Kamloops
Mt. Shasta
Mt. Whitney

Usage notes

All files can be opened with a standard text editor. Some files are associated with specific programs, which are noted in the file description. All compressed files can be uncompressed with gzip.

File types

.txt / .txt.gz: These are regular (.txt) or gzip-compressed (.txt.gz) text files that can be opened using a text editor. They often serve as input for programs.
.pestPG: These are sliding window nucleotide diversity estimates created by angsd. They are all formatted as described in https://popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests. Briefly, it is a 14 column file (tab separated). The first column contains information about the region. The second and third column is the reference name and the center of the window. The next 5 columns are 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, and L. The next 5 columns are 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, and Zeng's E. The final column is the effective number of sites with data in the window. They can also be opened using a text editor.
.csv: These are comma separated variable files that often serve as input to other programs. They can be opened using a text editor.
.Q: These are output files from admixture (https://dalexander.github.io/admixture/). Each column is one cluster. These files can be opened using a text editor.
.fna / .fna.gz: These are regular (.fna) or gzip-compressed (.fna.gz) nucleotide fasta files (https://en.wikipedia.org/wiki/FASTA_format). These are a standard format for storing genomic data. They can be opened using a text editor.
.gff: These are general feature format files, a standard format for storing genomic features. Each row records a distinct feature. They can be opened using a text editor.
.evalout: These are output files from SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README) that record the eigenvalues for each principal component. They can be opened using a text editor.
.evecout: These are output files from SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README) that record the eigenvectors for each sample for the first 10 principal components. They can be opened using a text editor.
.logfile: These are output files created by EIGENSOFT software (https://github.com/DReichLab/EIG). They can be opened using a text editor.
.parfile: There are parameter files that are used as input to EIGENSOFT software (https://github.com/DReichLab/EIG). These files provide arguments to the software, and these arguments are defined by (https://github.com/DReichLab/AdmixTools/blob/master/README.3PopTest). They can be opened using a text editor.
.ind: These are custom format files are used as input to EIGENSOFT software, and link each sample to its population. They can be opened using a text editor.
.qp3Pop: These are custom format files that are used as input to EIGENSOFT software. They can be opened using a text editor.
.map: These are custom format files used as input to PLINK software (https://www.cog-genomics.org/plink/1.9/formats#map). The format of this file is as follows: A text file with no header line, and one line per variant with the following 3-4 fields: 1) Chromosome code. 2) Variant identifier 3) Position in morgans or centimorgans 4) Base-pair coordinate. They can be opened using a text editor.
.nosex: These are custom format files used as input to PLINK software. This is simply a list of samples with ambiguous sex codes. They can be opened using a text editor.
.ped: These are custom format files used as input to PLINK software. The format of this file is as follows: Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are 1) Family ID ('FID') 2) Within-family ID ('IID'; cannot be '0') 3) Within-family ID of father ('0' if father isn't in dataset) 4) Within-family ID of mother ('0' if mother isn't in dataset) 5) Sex code ('1' = male, '2' = female, '0' = unknown) 6) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control). The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on. They can be opened using a text editor.
.raxml: These are output files from RAxML (https://cme.h-its.org/exelixis/web/software/raxml/) and are formatted as a Newick file described in https://en.wikipedia.org/wiki/Newick_format. They can be opened using a text editor or any tree-reading software.

Description of the data and file structure

Directory of data present:

Input data

all200Trout.mafs.genmap.repeatMasked.thinned.mafs.tsv.gz - a compressed .tsv file that contains the output of angsd, containing variable sites (after filtering). Columns are chromosome, position, major allele, minor allele, reference allele.
GCF_013265735.2_USDA_OmykA_1.1_genomic.fna.gz - gzip compressed fasta file that contains the OmykA_1.1 nuclear genome
MitoOnly.GCF_013265735.2_USDA_OmykA_1.1_genomic.fna - mitochondrial reference genome, in fasta format
plink.pruned.map - plink map file used in f3, d-stat, and pca analysis. Columns are chromosome, variant id, position in centimorgans, position in basepairs.
plink.pruned.nosex - plink sample list file used in f3, d-stat, and pca analysis. Columns are sample position, sample ID
plink.pruned.ped - plink ped file used in f3, d-stat, and pca analysis. Formatted as described in https://www.cog-genomics.org/plink/1.9/formats#ped

Output data

Sliding window nucleotide diversity estimates

These files are sliding window nucleotide diversity estimates, and include all samples (including hybrids). They are all formatted as described in https://popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests

File Naming Conventions

Rather than write out the 30 files included in this category, I will describe how to interpret the file names. These files are named "[population][historical or modern (optional)].noTrans.[noHybrids (optional)].thetasWindows.gz.pestPG" where:

"population" is a one of the above-described natural populations or hatchery strains, often abbreviated
"historical or modern" is either His or Mod or absent for hatchery strains as all hatchery strains are modern.
"noTrans" indicates that nucleotide transitions are not included
"noHybrids" indicates that hybrid samples are not included
For example, nacimientoHis.noTrans.noHybrids.thetasWindow.gz.pestPG describes the sliding window nucleotide diversity estimates for the historical Nacimiento River population, without nucleotide transitions, and without hybrids.

Isolation by distance plot data

These files all include the x (marine distance) and y (genetic similarity) data for isolation by distance (ibd) plots

historicalxyIBDNew.csv - x and y positions for historical ibd plot
modernNoWeirdxyIBDNew.csv - x and y positions for historical ibd plot
modernWithWeirdxyIBDNew.csv - x and y positions for modern ibd plot

Principal Component Analysis

These files were all generated using SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README)

newVarsNewNacimWithGuntherThinCutthroat.4cut.evalout - eigenvalue output from smartpca, for all samples
newVarsNewNacimWithGuntherThinCutthroat.4cut.evecout - eigenvector output from smartpca, for all samples
newVarsNewNacimWithGuntherThinCutthroatInversionOnly.evalout - eigenvalue output from smartpca, for all samples, only Omy05 inversion
newVarsNewNacimWithGuntherThinCutthroatInversionOnly.evecout - eigenvector output from smartpca, for all samples, only Omy05 inversion
newVarsNewNacimWithGuntherThinCutthroat.nohybrids.evalout - eigenvalue output from smartpca, for non-hybrid samples
newVarsNewNacimWithGuntherThinCutthroat.nohybrids.evecout - eigenvalue output from smartpca, for non-hybrid samples

F3 and D statistics

These files were all generated using qp3Pop (https://github.com/DReichLab/AdmixTools/blob/master/README.3PopTest) or qpDstat (https://github.com/DReichLab/AdmixTools/blob/master/README.Dstatistics)

newVarsNewNacimWithGuntherThinCutthroat.f3.pops.logfile - f3 stats for modern and historical, no hybrids
newVarsNewNacimWithGuntherThinCutthroat.f3.subset.allTogether.logfile - f3 stats for modern populations with hybrids
newVarsNewNacimWithGuntherThinCutthroat.f3.subset.allTogether.parfile - parameter file for f3 analysis
newVarsNewNacimWithGuntherThinCutthroat.weirdNacim.d.logfile - qpDstat d-statistic results
newVarsWithGuntherThinCutthroat.f3.historical.logfile - f3 stats for individual pairs of historical samples
newVarsWithGuntherThinCutthroat.f3.historical.parfile - parameter file for f3 analysis
newVarsWithGuntherThinCutthroat.f3.modernAll.logfile - f3 stats for individual pairs of modern samples
newVarsWithGuntherThinCutthroat.f3.modernAll.parfile - parameter file for f3 analysis
newVarsWithGuntherThinCutthroat.f3.modernAll.pops.parfile - parameter file for f3 analysis
newVarsWithGuntherThinCutthroat.f3.modernNotHybrid.logfile - f3 stats for individual pairs of modern samples without hybrids
newVarsWithGuntherThinCutthroat.f3.modernNotHybrid.parfile - parameter file for f3 analysis

Admixture analysis

forAdmixtureNewNacim4cut.3.Q - output from admixture (https://dalexander.github.io/admixture/) for all samples, 3 clusters, each column is one cluster.

Mitochondrial analysis

RAxML_bipartitions.all3xfilteredFinal.withNewNacim.min2thirds.raxml - RAxML output file from mitochondrial analysis. Formatted as a Newick file described in https://en.wikipedia.org/wiki/Newick_format

Miscellaneous

ancient.submat.txt - nucleotide substitution matrices for ancient DNA, where columns and rows are both A,C,G,T.
colorsToModify.csv - colors used with samples in certain plots, columns are sample ID, population, hex color
GCF_013265735.2_USDA_OmykA_1.1_angsd_region_file.txt - angsd region file that includes variable sites (after filtering). Columns are chromosome, start position, end position
GCF_013265735.2_USDA_OmykA_1.1_genomic.fna.out.gff - output of repeat masker for OmykA_1.1 genome in standard GFF format
GCF_013265735.2_USDA_OmykA_1.1_genomic.genmap.txt.gz - gzip compressed output of GenMap (https://github.com/cpockrandt/genmap) for OmykA_1.1 genome. Each chromosome is indicated by ">" character, and the following line is the mappability for each position in that chromosome.
newVarsNewNacimWithGuntherThinCutthroat.pileupCaller.ref.pops.ind - key for matching samples and populations
newVarsNewNacimWithGuntherThinCutthroat.pileupCaller.ref.pops.pca.ind - additional key for matching samples and populations
newVarsNewNacimWithGuntherThinCutthroat.pops.qp3Pop - input trios of samples for qp3Pop
newVarsNewNacimWithGuntherThinCutthroat.pops.together.qp3Pop - input trios of samples for qp3Pop
newVarsWithGuntherThinCutthroat.qp3pop.historicalPairwise.txt - input trios of samples for qp3Pop
newVarsWithGuntherThinCutthroat.qp3pop.modernAllPairwise.txt - input trios of samples for qp3Pop
newVarsWithGuntherThinCutthroat.qp3pop.modernNotHybridPairwise.txt - input trios of samples for qp3Pop
riverDistancesWithSFBayKm.csv - river distances used in isolation by distance plots. Columns are river 1, river 2, marine distance between rivers 1 and 2.

Code/Software

This analysis was performed with both shell scripts and Jupyter notebooks. We recommend running these files using a linux environment and in Jupyter, which can easily be installed using conda. The Jupyter notebooks should function in a python 3.8 environment. If you have any questions about running these files, please contact sharo@berkeley.edu or carlosjg@ucsc.edu.

The following python packages are required to run the Jupyter notebooks:

gzip
matplotlib
mantel
numpy
pandas
seaborn
toytree
toyplot

We note that for many of these scripts, our paths to certain programs and files are hardcoded. You should review the scripts and ensure that these paths are replaced with the location of the program or file on your system.

To reproduce the analyses from the manuscript, we recommend running the notebooks in the below order. However, if you are interested in a specific analysis, it may be possible to only run the associated script as we have provided most intermediate files. The file names for intermediate files are identical to those provided in the Dryad repository, but you will need to replace the paths with the location of these Dryad files on your system.

Use process_and_align_historical_reads.sh to align reads from historical samples
Use process_and_align_contemporary_reads.sh to align reads from contemporary samples
Use angsd as described in Methods to identify single nucleotide variants with a minor allele frequency of 5% or greater
Run GenMap as described in Methods
Run maskGenMap.ipynb to mask variants based on GenMap output
Run RepeatMasker as described in Methods
Run addRepeatMasking.ipynb to mask variants for repeats and thin variants
Run removeReferenceBias.sh to reduce reference bias using Günther and Nettelblad (2019) scripts
Run performFstatsAndPCA.sh to call pseudo-haplotypes with pileupcaller, perform f-statistics, d-statistics, and pca
Run generatePCA.ipynb to generate Fig. 1D and Fig. 2C
Run generateAdxmiturePlots.ipynb to generate Fig. 1E and Fig. S2
Run ibdPopulationComparisons.ipynb then ibdIndividualComparisons.ipynb to generate Figs. 2A,B
Run analyzeIntrogression.ipynb to generate Fig. 3B and Fig. S4
Run performPCAInversionOnly.sh to perform pca analysis for just the inversion on chr Omy05
Run inversionAnalysis.ipynb to generate Fig. 4A and Fig. S5
Run calculatePopulationStats.sh then generateDiversityPlots.ipynb to generate Fig. 4B and Fig. S6
Use run_mia.sh to run mia as described in Methods
Run filterMitoCoverage.ipynb to filter consensus mitochondrial sequences
Run Muscle and RAXML as described in Methods to create a tree of mitochondrial sequences
Run mitoTree.ipynb to generate Fig. S3
Run runMosdepth.sh then plotDepthOfCoverage.ipynb to generate Fig. S1

Data and source code for: Recent adaptation in a threatened salmonid revealed by museum genomics

Data files

Abstract

Usage notes

File types

Description of the data and file structure

Input data

Output data

Sliding window nucleotide diversity estimates

File Naming Conventions

Isolation by distance plot data

Principal Component Analysis

F3 and D statistics

Admixture analysis

Mitochondrial analysis

Miscellaneous

Code/Software

Data and source code for: Recent adaptation in a threatened salmonid revealed by museum genomics

Data files

Abstract

README: Data and source code for: Recent adaptation in threatened steelhead trout revealed by museum genomics

Usage notes

File types

Description of the data and file structure

Input data

Output data

Sliding window nucleotide diversity estimates

File Naming Conventions

Isolation by distance plot data

Principal Component Analysis

F3 and D statistics

Admixture analysis

Mitochondrial analysis

Miscellaneous

Code/Software

Works referencing this dataset