Conserved islands of divergence associated with adaptive variation in sockeye salmon are maintained by multiple mechanisms
Data files
Sep 06, 2023 version files 24.17 GB
-
EuclideChristensen_merged_miss.0.8.vcf.gz
-
LG1-29_output.recalibrated.filtered.1.no-indels.no-fail.biallele.miss.0.9.maf.0.05.ab.0.2.vcf.gz
-
LG1-29_output.recalibrated.filtered.1.no-indels.no-fail.biallele.miss.0.9.maf.0.05.ab.0.2.vcf.idx
-
LG1-29_output.recalibrated.filtered.vcf.gz
-
LG1-29_output.recalibrated.filtered.vcf.idx
-
MERGED_SOCKEYE_METADATA.tsv
-
README.md
-
SOCKEYE_METADATA.tsv
Abstract
Local adaptation is facilitated by loci clustered in relatively few regions of the genome, termed genomic islands of divergence. The mechanisms that create and maintain these islands and how they contribute to adaptive divergence is an active research topic. Here, we use sockeye salmon as a model to investigate both the mechanisms responsible for creating islands of divergence and the patterns of differentiation at these islands. Previous research suggested that multiple islands contributed to adaptive radiation of sockeye salmon. However, the low-density genomic methods used by these studies made it difficult to fully elucidate the mechanisms responsible for islands and connect genotypes to adaptive variation. We used whole genome resequencing to genotype millions of loci to investigate patterns of genetic variation at islands and the mechanisms that potentially created them. We discovered 64 islands, including 16 clustered in four genomic regions shared between two isolated populations. Characterization of these four regions suggested that three were likely created by structural variation, while one was created by processes not involving structural variation. All four regions were small (< 600 kb), suggesting low recombination regions do not have to span megabases to be important for adaptive divergence. Differentiation at islands was not consistently associated with established population attributes. In sum, the landscape of adaptive divergence and the mechanisms that create it are complex; this complexity likely helps to facilitate fine-scale local adaptation unique to each population.
README: Conserved islands of divergence associated with adaptive variation in sockeye salmon are maintained by multiple mechanisms
The following files include genotype files in variant call format (VCF) for all sockeye salmon genotyped for the associated publication by the same name published in Molecular Ecology in 2023.
Description of the data and file structure
VCF file descriptions
LG1-29_output.recalibrated.filtered.vcf.gz includes the raw genotype calls for all possible variants identified by GATK4. These data were subsequently re-calibrated using known SNP variants and filtered (see manuscript for filtering details) resulting in file LG1-29_output.recalibrated.filtered.1.no-indels.no-fail.biallele.miss.0.9.maf.0.05.ab.0.2.vcf.gz which was used for downstream analysis. variant index files (.idx) are also included for both files.
To assess rangewide consistency in haplotypes, data were merged with existing sockeye salmon whole genome sequencing data from Christensen et al., 2020 (https://doi.org/10.1371/journal.pone.0240935). The merged dataset can be found in EuclideChristensen_merged_miss.0.8.vcf.gz. A condensed version of the metadata for all samples in the merged file can be found in MERGED_SOCKEYE_METADATA.csv
SOCKYEYE_METADATA.tsv
Sampling metadata for 189 individuals sequenced can be found in file SOCKYEYE_METADATA.tsv. Sample ID in VCF files are in the format "FID_IID"
FID = location ID
IID = Individual ID number
VIAL = Tissue sample ID number
DRAINAGE = River drainage that the individual was captured in
SPAWNING_TYPE = Description of either lake type or sea/river type spawning run.
SPAWNING_HABITAT = spawning site habitat. Corresponds to a general description of the water and surrounding substrate
LAT = Sample site latitude in decimal degrees
LON = Sample longitude in decimal degrees
collection = collection ID number referring to the name of the sample collection venture.
sex = identified sex (not collected for all samples)
len = length of individual (mm, not collected for all samples)
depth = body depth of individual (mm, not collected for all samples)
depth_div_len = depth of fish divided by length of fish (not collected for all samples)
DATE = Sampling date
NOTES = Additional notes and comments about samples
MERGED_SOCKEYE_METADATA.tsv
sampleName = Sample ID as listed in VCF file.
drainage = Name of the major river drainage where the individual was sampled.
type = organism name
pop = local river/drainage population code
pop2 = broad scale/major drainage population code
All files follow standard GATK file structure: https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format
Sharing/Access information
Rangewide data for sockeye used in the publication can be found already published:
Christensen et al. 2020 (https://doi.org/10.1371/journal.pone.0240935)
Data (NCBI: PRJNA530256):https://www.ncbi.nlm.nih.gov/bioproject/PRJNA530256/
Data (NCBI: PRJNA1006708):https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1006708/
Methods
Sampling design
We resequenced genomes of sockeye salmon from seven populations in Southwest Alaska, USA (these samples are a subset of those analyzed in Larson et al., 2019). Fin-clips from 27 individuals per population (189 individuals total) were obtained from three lake-type spawning populations in each of the Kvichak River and Wood River drainages as well as one putatively ancestral sea/river population in the Nushagak River drainage. Lake-type samples were further subdivided into the following groups based on spawning habitat: mainland beaches, island beaches, creeks, and rivers. Mainland and island beaches are similar except island beaches are found in the middle of lakes where they are highly affected by wind and wave action (Stewart et al., 2003). Creeks are narrow (< 5 m wide) and shallow (< 0.5 m deep on average) while rivers are wide (> 30 m wide), deep (> 0.5 m deep), and fast flowing (Quinn et al., 2001). All samples were collected from spawning adults by Alaska Department of Fish and Game between 1999 and 2013 and provided as extracted DNA (extracted with Qiagen DNAeasy Blood and Tissue Kits, Hilden, Germany).
Whole genome library preparation and sequencing
Libraries were prepared according to Baym et al. (2015) and Therkildsen and Palumbi (2017) with the following modifications. Input DNA was normalized to 10 ng for each individual. Steps for 96-well AMPure XP (Beckman Colter; Brea, CA) purification; product quantification, normalization, and pooling; and size selection were replaced with a SequalPrep (ThermoFisher Scientific, Waltham, MA, USA) normalization and pooling protocol, similar to that used in GT-seq (Campbell et al., 2015). We used three SequalPrep plates per each of the two 96-well tagmented and adaptor-ligated DNA library plates and pooled the full eluate per individual DNA library to increase total yield. Normalized pooled libraries were subject to a 0.6X size selection, purification, and volume concentration with AMPure XP following Therkildsen and Palumbi (2017). In-house QC consisted of visualization on a precast 2% agarose E-Gel (ThermoFisher Scientific) and quantification with a Qubit HS dsDNA Assay Kit (ThermoFisher Scientific). We constructed two libraries each containing 96 individuals and each of these libraries was sequenced on three Novaseq S4 lanes (six lanes total) at Novogene (Sacramento, CA, USA).
Genotype calling and quality control
Variants and genotypes were called using the Genotype Analysis Toolkit (GATK) version 4.1.7 (DePristo et al., 2011; McKenna et al., 2010) and a protocol that closely followed Christensen et al. (2020). Paired-end reads were aligned to the sockeye salmon genome (GCF_006149115.2; Christensen et al., 2020) with BWA MEM v.0.7.17 (Li, 2013) and indexed and sorted with Samtools v.1.10 (Li et al., 2009). Next, readgroups for each alignment file (bam file) were assigned using Picard v2.22.6 (AddOrReplaceReadGroups; http://broadinstitute.github.io/picard). Individual bam files produced on separate sequencing lanes were merged, and PCR duplicates were marked using the MarkDuplicates function from Picard with stringency set to “LENIENT”. Individual genomic VCF files (gvcf) were generated from alignments using HaplotypeCaller from GATK. A single database was created containing all individual gvcf files using GenomeDBImport from GATK. Once the variants from all individuals had been added to the database, joint-genotyping was conducted using the GenotypeGVCFs function. The resulting variant file (vcf) was then hard filtered using the VariantFiltration function (filter expression = QD < 2.0 || FS > 60.0 || SOR < 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0). All variants that passed hard filter were used in conjunction with three datasets used previously as truth datasets by (Christensen et al., 2020) for GATK’s VarientRecalibrator function. The tranches file generated by VarientRecalibrator was subsequently used as the input for the ApplyVQSR function and to produce a corrected vcf file and submitted to additional variant filtration in VCFtools v.0.1.16 (parameters: --maf 0.05, --max-alleles 2, --min-alleles 2, --max-missing 0.9, --remove-filtered-all –remove-indels; Danecek et al., 2011). Finally, loci with an allele balance of less than 0.2 were marked. The resulting vcf file constituted our baseline file for all other analysis and downstream processing.
Creating a merged dataset
Because the islands of divergence we identified were consistent among spatially isolated drainages in Alaska, we hypothesized that these regions may be conserved in other sockeye populations. To test this, we merged the dataset generated in the present study with whole-genome data from 78 sockeye salmon (kokanee excluded) from Christensen et al. (2020). This dataset was sequenced to a similar depth of coverage and was processed using an almost identical GATK4 pipeline. The dataset included 16 spawning populations that we grouped into five drainage regions: Bristol Bay (N = 12 individuals), Fraser/Columbia river basins (N = 47), Gulf of Alaska (N = 8), Northern British Columbia (N = 9), and Russia (N = 2). The variants identified in Christensen et al. (2020) were merged with ours using bcftools v.1.11 (Danecek et al., 2021) by retaining variants that intersected between the two datasets, had a genotyping rate > 80%, and were positioned within one of the refined haploblock regions.