Skip to main content
Dryad

Data and source code for: Recent adaptation in a threatened salmonid revealed by museum genomics

Cite this dataset

Sharo, Andrew et al. (2024). Data and source code for: Recent adaptation in a threatened salmonid revealed by museum genomics [Dataset]. Dryad. https://doi.org/10.5061/dryad.g79cnp5z9

Abstract

Steelhead/rainbow trout (Oncorhynchus mykiss) is an imperiled salmonid with two main life history strategies: migrate to the ocean or remain in freshwater. Domesticated hatchery forms of this species have been stocked into almost all California waterbodies, possibly resulting in introgression into natural populations and altered population structure. 

We compared whole-genome sequence data from contemporary populations against a set of museum population samples of steelhead from the same locations that were collected prior to most hatchery stocking. 

We observed minimal introgression and few steelhead-hatchery trout hybrids despite a century of extensive stocking. Our historical data show signals of introgression with a sister species and indications of an early hatchery facility. Finally, we found that migration-associated haplotypes have become less frequent over time, a likely adaptation to decreased opportunities for migration. Since contemporary migration-associated haplotype frequencies have been used to guide species management, we consider this to be a rare example of shifting baseline syndrome that has been validated with historical data. 

We suggest cautious optimism that a century of hatchery stocking has had minimal impact on California steelhead population genetic structure, but we note that continued shifts in life history may lead to further declines in the ocean-going form of the species. 

README: Data and source code for: Recent adaptation in threatened steelhead trout revealed by museum genomics

https://doi.org/10.5061/dryad.g79cnp5z9

This directory contains the data and code to reproduce analyses in the manuscript "Recent adaptation in a threatened salmonid revealed by museum genomics." Most of these data are input to or output from methods commonly used in conservation genetics. These data are largely derived from low-coverage whole genome sequencing of 75 historical steelhead trout, 75 contemporary steelhead trout, and 50 contemporary hatchery rainbow trout. Please see the manuscript for additional details.

Modern and historical natural populations:

  • Eel River
  • Coyote Creek
  • San Lorenzo River
  • Llagas Creek
  • Nacimiento River

Hatchery strains:

  • Coleman
  • Eagle Lake
  • Kamloops
  • Mt. Shasta
  • Mt. Whitney

Usage notes

All files can be opened with a standard text editor. Some files are associated with specific programs, which are noted in the file description. All compressed files can be uncompressed with gzip.

File types

  • .txt / .txt.gz: These are regular (.txt) or gzip-compressed (.txt.gz) text files that can be opened using a text editor. They often serve as input for programs.
  • .pestPG: These are sliding window nucleotide diversity estimates created by angsd. They are all formatted as described in https://popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests. Briefly, it is a 14 column file (tab separated). The first column contains information about the region. The second and third column is the reference name and the center of the window. The next 5 columns are 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, and L. The next 5 columns are 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, and Zeng's E. The final column is the effective number of sites with data in the window. They can also be opened using a text editor.
  • .csv: These are comma separated variable files that often serve as input to other programs. They can be opened using a text editor.
  • .Q: These are output files from admixture (https://dalexander.github.io/admixture/). Each column is one cluster. These files can be opened using a text editor.
  • .fna / .fna.gz: These are regular (.fna) or gzip-compressed (.fna.gz) nucleotide fasta files (https://en.wikipedia.org/wiki/FASTA_format). These are a standard format for storing genomic data. They can be opened using a text editor.
  • .gff: These are general feature format files, a standard format for storing genomic features. Each row records a distinct feature. They can be opened using a text editor.
  • .evalout: These are output files from SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README) that record the eigenvalues for each principal component. They can be opened using a text editor.
  • .evecout: These are output files from SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README) that record the eigenvectors for each sample for the first 10 principal components. They can be opened using a text editor.
  • .logfile: These are output files created by EIGENSOFT software (https://github.com/DReichLab/EIG). They can be opened using a text editor.
  • .parfile: There are parameter files that are used as input to EIGENSOFT software (https://github.com/DReichLab/EIG). These files provide arguments to the software, and these arguments are defined by (https://github.com/DReichLab/AdmixTools/blob/master/README.3PopTest). They can be opened using a text editor.
  • .ind: These are custom format files are used as input to EIGENSOFT software, and link each sample to its population. They can be opened using a text editor.
  • .qp3Pop: These are custom format files that are used as input to EIGENSOFT software. They can be opened using a text editor.
  • .map: These are custom format files used as input to PLINK software (https://www.cog-genomics.org/plink/1.9/formats#map). The format of this file is as follows: A text file with no header line, and one line per variant with the following 3-4 fields: 1) Chromosome code. 2) Variant identifier 3) Position in morgans or centimorgans 4) Base-pair coordinate. They can be opened using a text editor.
  • .nosex: These are custom format files used as input to PLINK software. This is simply a list of samples with ambiguous sex codes. They can be opened using a text editor.
  • .ped: These are custom format files used as input to PLINK software. The format of this file is as follows: Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are 1) Family ID ('FID') 2) Within-family ID ('IID'; cannot be '0') 3) Within-family ID of father ('0' if father isn't in dataset) 4) Within-family ID of mother ('0' if mother isn't in dataset) 5) Sex code ('1' = male, '2' = female, '0' = unknown) 6) Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control). The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on. They can be opened using a text editor.
  • .raxml: These are output files from RAxML (https://cme.h-its.org/exelixis/web/software/raxml/) and are formatted as a Newick file described in https://en.wikipedia.org/wiki/Newick_format. They can be opened using a text editor or any tree-reading software.

Description of the data and file structure

Directory of data present:

Input data

  • all200Trout.mafs.genmap.repeatMasked.thinned.mafs.tsv.gz - a compressed .tsv file that contains the output of angsd, containing variable sites (after filtering). Columns are chromosome, position, major allele, minor allele, reference allele.
  • GCF_013265735.2_USDA_OmykA_1.1_genomic.fna.gz - gzip compressed fasta file that contains the OmykA_1.1 nuclear genome
  • MitoOnly.GCF_013265735.2_USDA_OmykA_1.1_genomic.fna - mitochondrial reference genome, in fasta format
  • plink.pruned.map - plink map file used in f3, d-stat, and pca analysis. Columns are chromosome, variant id, position in centimorgans, position in basepairs.
  • plink.pruned.nosex - plink sample list file used in f3, d-stat, and pca analysis. Columns are sample position, sample ID
  • plink.pruned.ped - plink ped file used in f3, d-stat, and pca analysis. Formatted as described in https://www.cog-genomics.org/plink/1.9/formats#ped

Output data

Sliding window nucleotide diversity estimates

These files are sliding window nucleotide diversity estimates, and include all samples (including hybrids). They are all formatted as described in https://popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests

File Naming Conventions

Rather than write out the 30 files included in this category, I will describe how to interpret the file names. These files are named "[population][historical or modern (optional)].noTrans.[noHybrids (optional)].thetasWindows.gz.pestPG" where:

  • "population" is a one of the above-described natural populations or hatchery strains, often abbreviated
  • "historical or modern" is either His or Mod or absent for hatchery strains as all hatchery strains are modern.
  • "noTrans" indicates that nucleotide transitions are not included
  • "noHybrids" indicates that hybrid samples are not included
  • For example, nacimientoHis.noTrans.noHybrids.thetasWindow.gz.pestPG describes the sliding window nucleotide diversity estimates for the historical Nacimiento River population, without nucleotide transitions, and without hybrids. 
Isolation by distance plot data

These files all include the x (marine distance) and y (genetic similarity) data for isolation by distance (ibd) plots

  • historicalxyIBDNew.csv - x and y positions for historical ibd plot
  • modernNoWeirdxyIBDNew.csv - x and y positions for historical ibd plot
  • modernWithWeirdxyIBDNew.csv - x and y positions for modern ibd plot
Principal Component Analysis

These files were all generated using SmartPCA (https://github.com/chrchang/eigensoft/blob/master/POPGEN/README)

  • newVarsNewNacimWithGuntherThinCutthroat.4cut.evalout - eigenvalue output from smartpca, for all samples
  • newVarsNewNacimWithGuntherThinCutthroat.4cut.evecout - eigenvector output from smartpca, for all samples
  • newVarsNewNacimWithGuntherThinCutthroatInversionOnly.evalout - eigenvalue output from smartpca, for all samples, only Omy05 inversion
  • newVarsNewNacimWithGuntherThinCutthroatInversionOnly.evecout - eigenvector output from smartpca, for all samples, only Omy05 inversion
  • newVarsNewNacimWithGuntherThinCutthroat.nohybrids.evalout - eigenvalue output from smartpca, for non-hybrid samples
  • newVarsNewNacimWithGuntherThinCutthroat.nohybrids.evecout - eigenvalue output from smartpca, for non-hybrid samples
F3 and D statistics

These files were all generated using qp3Pop (https://github.com/DReichLab/AdmixTools/blob/master/README.3PopTest) or qpDstat (https://github.com/DReichLab/AdmixTools/blob/master/README.Dstatistics)

  • newVarsNewNacimWithGuntherThinCutthroat.f3.pops.logfile - f3 stats for modern and historical, no hybrids
  • newVarsNewNacimWithGuntherThinCutthroat.f3.subset.allTogether.logfile - f3 stats for modern populations with hybrids
  • newVarsNewNacimWithGuntherThinCutthroat.f3.subset.allTogether.parfile - parameter file for f3 analysis
  • newVarsNewNacimWithGuntherThinCutthroat.weirdNacim.d.logfile - qpDstat d-statistic results
  • newVarsWithGuntherThinCutthroat.f3.historical.logfile - f3 stats for individual pairs of historical samples
  • newVarsWithGuntherThinCutthroat.f3.historical.parfile - parameter file for f3 analysis
  • newVarsWithGuntherThinCutthroat.f3.modernAll.logfile - f3 stats for individual pairs of modern samples
  • newVarsWithGuntherThinCutthroat.f3.modernAll.parfile - parameter file for f3 analysis
  • newVarsWithGuntherThinCutthroat.f3.modernAll.pops.parfile - parameter file for f3 analysis
  • newVarsWithGuntherThinCutthroat.f3.modernNotHybrid.logfile - f3 stats for individual pairs of modern samples without hybrids
  • newVarsWithGuntherThinCutthroat.f3.modernNotHybrid.parfile - parameter file for f3 analysis
Admixture analysis
Mitochondrial analysis

Miscellaneous

  • ancient.submat.txt - nucleotide substitution matrices for ancient DNA, where columns and rows are both A,C,G,T.
  • colorsToModify.csv - colors used with samples in certain plots, columns are sample ID, population, hex color
  • GCF_013265735.2_USDA_OmykA_1.1_angsd_region_file.txt - angsd region file that includes variable sites (after filtering). Columns are chromosome, start position, end position
  • GCF_013265735.2_USDA_OmykA_1.1_genomic.fna.out.gff - output of repeat masker for OmykA_1.1 genome in standard GFF format
  • GCF_013265735.2_USDA_OmykA_1.1_genomic.genmap.txt.gz - gzip compressed output of GenMap (https://github.com/cpockrandt/genmap) for OmykA_1.1 genome. Each chromosome is indicated by ">" character, and the following line is the mappability for each position in that chromosome.
  • newVarsNewNacimWithGuntherThinCutthroat.pileupCaller.ref.pops.ind - key for matching samples and populations
  • newVarsNewNacimWithGuntherThinCutthroat.pileupCaller.ref.pops.pca.ind - additional key for matching samples and populations
  • newVarsNewNacimWithGuntherThinCutthroat.pops.qp3Pop - input trios of samples for qp3Pop
  • newVarsNewNacimWithGuntherThinCutthroat.pops.together.qp3Pop - input trios of samples for qp3Pop
  • newVarsWithGuntherThinCutthroat.qp3pop.historicalPairwise.txt - input trios of samples for qp3Pop
  • newVarsWithGuntherThinCutthroat.qp3pop.modernAllPairwise.txt - input trios of samples for qp3Pop
  • newVarsWithGuntherThinCutthroat.qp3pop.modernNotHybridPairwise.txt - input trios of samples for qp3Pop
  • riverDistancesWithSFBayKm.csv - river distances used in isolation by distance plots. Columns are river 1, river 2, marine distance between rivers 1 and 2.

Code/Software

This analysis was performed with both shell scripts and Jupyter notebooks. We recommend running these files using a linux environment and in Jupyter, which can easily be installed using conda. The Jupyter notebooks should function in a python 3.8 environment. If you have any questions about running these files, please contact sharo@berkeley.edu or carlosjg@ucsc.edu.

The following python packages are required to run the Jupyter notebooks:

gzip
matplotlib
mantel
numpy
pandas
seaborn
toytree
toyplot

We note that for many of these scripts, our paths to certain programs and files are hardcoded. You should review the scripts and ensure that these paths are replaced with the location of the program or file on your system.

To reproduce the analyses from the manuscript, we recommend running the notebooks in the below order. However, if you are interested in a specific analysis, it may be possible to only run the associated script as we have provided most intermediate files. The file names for intermediate files are identical to those provided in the Dryad repository, but you will need to replace the paths with the location of these Dryad files on your system. 

  1. Use process_and_align_historical_reads.sh to align reads from historical samples
  2. Use process_and_align_contemporary_reads.sh to align reads from contemporary samples
  3. Use angsd as described in Methods to identify single nucleotide variants with a minor allele frequency of 5% or greater
  4. Run GenMap as described in Methods
  5. Run maskGenMap.ipynb to mask variants based on GenMap output
  6. Run RepeatMasker as described in Methods
  7. Run addRepeatMasking.ipynb to mask variants for repeats and thin variants
  8. Run removeReferenceBias.sh to reduce reference bias using Günther and Nettelblad (2019) scripts
  9. Run performFstatsAndPCA.sh to call pseudo-haplotypes with pileupcaller, perform f-statistics, d-statistics, and pca
  10. Run generatePCA.ipynb to generate Fig. 1D and Fig. 2C
  11. Run generateAdxmiturePlots.ipynb to generate Fig. 1E and Fig. S2
  12. Run ibdPopulationComparisons.ipynb then ibdIndividualComparisons.ipynb to generate Figs. 2A,B
  13. Run analyzeIntrogression.ipynb to generate Fig. 3B and Fig. S4
  14. Run performPCAInversionOnly.sh to perform pca analysis for just the inversion on chr Omy05
  15. Run inversionAnalysis.ipynb to generate Fig. 4A and Fig. S5
  16. Run calculatePopulationStats.sh then generateDiversityPlots.ipynb to generate Fig. 4B and Fig. S6
  17. Use run_mia.sh to run mia as described in Methods
  18. Run filterMitoCoverage.ipynb to filter consensus mitochondrial sequences
  19. Run Muscle and RAXML as described in Methods to create a tree of mitochondrial sequences
  20. Run mitoTree.ipynb to generate Fig. S3
  21. Run runMosdepth.sh then plotDepthOfCoverage.ipynb to generate Fig. S1

Funding

National Science Foundation, Award: 2109912