History of speciation inferred from genomic analysis of a species complex of north temperate fishes
Data files
May 21, 2025 version files 55 GB
-
MASTER350revisedSNPsH06.vcf
55 GB
-
README.md
15.48 KB
-
Supporting_Information.csv
62.97 KB
Abstract
Comparative biogeographic analyses have provided key insights into the history of divergence among closely-related species. We collected genomic data across much of the range of two salmonid fishes, Arctic char (AC, Salvelinus alpinus) and Dolly Varden (DV, S. malma) that comprise a complex of lineages whose relationships and evolutionary interactions have remained uncertain. A time-calibrated phylogeny indicated reciprocal monophyly of AC and DV, that the species diverged an estimated ~1.4 million years ago, and that eastern Pacific southern Dolly Varden (S. m. lordi) is the basal lineage within DV. Historical and contemporary gene flow was detected between species and regional groups within species. We found strong evidence for a model of divergence without gene flow between AC and DV followed by secondary contact about 14,500 years ago with subsequent gene flow. Our geographic and genomic investigation within the AC-DV complex clarifies the origin and inter-relationships of the species and further highlight the North Pacific and Arctic as dynamic areas of evolution of regional faunas
Our submission consists of two files: (1) a CSV file of supporting information and (2) a VCF file of single-nucleotide polymorphisms.
(1) CSV Supporting information
A file of 21 columns consisting of:
Sample name: the name of each sample corresponding to samples in the VCF file, with the 5 code variables corresponding to:
- Taxon: AC=Arctic char, SC = Stone char, NDV = Northern Dolly Varden, EPSDV = Eastern Pacific Southern Dolly Varden, WPSDV = Western Pacific Southern Dolly Varden, xNDVEPSDV = Admixed NDV and EPSDV, x NDVWPSDV = admixed NDV and WPSDV, WSC = Whitespotted char, BT = Bull trout, LKTR = lake trout, BRTR = brook trout.
- Geographic area: AK = Alaska, CARC = Canadian Arctic, YT = Yukon Territory, ATL = western Atlantic Basin, JP = Japan, RU = Russia, KUR = Kuril Islands, BC = British Columbia, WA = Washington State, NA = Not applicable
- Life history applicable to DV and AC: AF = Allopatric, freshwater resident, AS = Allopatric, sea-run, SF = Sympatric, freshwater resident, NN/AN = not applicable.
- Three letter/number code corresponding to sequencing lane: A01, A02, A04, LBF, BWA, CHI
- Individual sample number
For example: AC_AK_AF_A04_031
Refers to an Arctic char from Alaska that is allopatric (with respect to Dolly Varden) and a permanent resident of freshwater, It was in sequencing lane A04 and is sample number 31 from that population.
Five columns (2-6) indicated whether (Y = yes, N = No) the sample was used in one of five analyses (SplitsTree, BEAST/SNAPP, PCA/fastSTRUCTURE/Admixture, Dsuite, Fastsimcoal2)
Columns 7-21 consist of:
The barcode of each sample used in each sequencing run
The barcode assigned to each plate if more than one per sequencing run
Name of the sequencing lane: A01, A02, A04, LBF, BWA, CHI
The taxonomic name of each sample
The original lab sample geographic location code and number
The name of the location code given immediately above
Latitude of sample site (Decimal degrees)
Longitude of sample site (Decimal degrees)
Collection year
Population number code shown in Supplementary table 1
all_reads: Total number of reads after demultiplexing with process_radtags
bam2_mapped: Number of reads mapped to the reference genome with bwa as reported by samtools flagstat
reads mapped with q>20: Number of reads mapped to the genome with mapping quality (MAPQ) af 20 or higher as reported by samtools view -q20
depth q>20: Number of positions in the genome with at least one read mapped with mapping quality of at least 20, obtained by counting the number of lines in the output of command samtools depth -Q 20
FMISS350: Proportion of missing data
Missing data code: NA
(2) VCF file (MASTER350revised.recode.vcf):
Single nucleotide polymorphisms in VCF format for 350 individuals from six species of Salvelinus (Pisces) from throughout the North Pacific, Canadian Arctic, and Canadian northwest Atlantic basins. The SNPs were called after aligning raw sequence reads to the Salvelinus malma reference genome
(assembly ASM291031v2; Christensen et al., 2018) across 36 non-sex liked chromosomes.
This file is the dataset used in our article, "History of speciation inferred from genomic analysis of a species complex of north temperate fishes"
The article investigates the history of speciation and gene flow in two species of Salvelinus which a checkered taxonomic history and produces a well-supported phylogeny, phylogeographic structure within species and strong evidence for historical introgression as the explanation for a history of taxonomic confusion.
This Variant Call Format (VCF) file contains Single Nucleotide Polymorphism (SNP) data from 350 individuals across 98 sampling locations. The average number of reads of minimum genotype quality of 20 was 6.8 million. The data were generated using genptyping-by-sequencing with the enzyme Pst1. We used VCFtools v0.1.11 (Danecek et al., 2011) to estimate the amount of missing data per sample and used only those with no more than 70% missing data (average of 0.51). We used VCFtools to filter these 350 samples to retain only biallelic SNPs (no indels or sites with more than two segregating bases; N = 7,434,394 retained). Finally, we eliminated SNPs that showed an observed heterozygosity of 0.6 or higher, as these are likely the result of mapping to paralogous regions of the genome, using a custom script (Owens et al., 2016). This resulted in a “master” VCF file with 6,601,213 SNPs that was subject to additional filters depending on the particular analysis (see associated manuscript).
File Details
File Name: MASTER350revised.recode.vcf
File Format: VCF (Variant Call Format) Version 4.2
Date: February 4, 2025
Source Software: Stacks v2.41
Data Description
The VCF file includes the following columns standard to the format:
#CHROM: Chromosome number
POS: Position of the SNP on the chromosome
ID: Identifier of the SNP
REF: Reference base
ALT: Alternate base(s)
QUAL: Quality score of the SNP
FILTER: Filter status
INFO: Additional information (e.g., allele frequency, number of samples)
FORMAT: Data format
Sample columns: One per individual, containing genotype information
Specific Fields in INFO
NS: Number of samples with data
AF: Allele frequency
DP: Total depth of reads
Specific Fields in FORMAT
GT: Genotype
AD: Allele depth
DP: Read depth
GQ: Genotype quality
GL: Genotype likelihood
How to Open and Analyze the Data
The VCF file can be opened and analyzed using various bioinformatics tools and software incliding:
IGV (Integrative Genomics Viewer): Useful for visualizing genomic data.
BCFtools: Command-line tool for working with VCF/BCF files, including filtering, viewing, and converting.
GATK4 (Genome Analysis Toolkit): Provides tools to analyze high-throughput sequencing data with a focus on variant discovery.
vcftools: Command-line program designed to work specifically with VCF files to perform various types of analyses.
Sharing/Access information
All sequencing data from this study is being deposited at NCBI Sequence Read Archive under submission SUB15066945.
Methods
Data Collection
We examined 350 DNA samples from a database of 909 fish acquired over the past 25 years in the context of various independent studies on char evolutionary ecology. Our samples comprised Northern (NDV, S. m. malma) and Southern Dolly Varden (SDV, S. m. lordi) collected from northwestern North America. The SDV samples were collectively referred to as Eastern Pacific Southern Dolly Varden, EPSDV. We also examined AC (S. alpinus erythrinus) from Alaska, Yukon, and east to the eastern Canadian Arctic and Atlantic Canada (see Supporting Information file). Samples included fish from allopatric and sympatric populations as well as both sea-run and freshwater-resident life history types. Our samples also included four specimens of the “stone char” a phenotypically distinct form of NDV from Kamchatka (see Melnick et al., 2020), five samples of other NDV from Russia (Kamchatka), 19 samples thought to be S. m. krascheninnikovi, S. m. malma, or hybrids between the two subspecies from the western Pacific (5 from Hokkaido, Japan and 14 from the Kuril Island chain). In Asia, S. m. krascheninnikovi is also referred to as Southern Dolly Varden; here, we refer to it as Western Pacific Southern Dolly Varden (WPSDV). We also included two samples each of coastal and interior lineages of bull trout (S. confluentus, Taylor et al., 1999), and two specimens each of brook trout (S. fontinalis), lake trout (S. namaycush), and white-spotted char (S. leucomaenis) as outgroup taxa. Initial identity of all taxa was based on independent information on geographic distribution, and morphological and genetic (nuclear DNA) data (e.g., Moore et al., 2015; Taylor & May-McNally, 2015; Weinsten et al., 2024a,b).
All fish tissues were collected under UBC Animal Care Committee approved protocols. Total genomic DNA was extracted from tissues using the DNeasy Blood & Tissue kit and eluted in 200 µL of AE buffer.
We used a reduced representation genome sequencing approach, genotyping-by-sequencing (GBS; Elshire et al., 2011), to generate sequence data from a representative fraction of the genome. We used a modified GBS protocol (Alcaide et al., 2014; Toews et al., 2016; Geraldes & Taylor, 2023) to generate pooled libraries of digested and individually barcoded DNA.
The libraries were sequenced using Illumina HiSeq 4000, NovaSeq 6000 SP and NovaSeq 6000 S4 always with 150 bp paired-end reads at the McGill University and Génome Québec Innovation Centre.
Analysis of the sequence data followed a bioinformatics pipeline available at https://doi.org/10.5061/dryad.t951d (Irwin et al., 2016) for GBS read processing and mapping with modifications detailed below. We used process_radtags from the STACKS v2.5 pipeline (Catchen et al., 2013; Rochette et al., 2019) to demultiplex the raw sequencing reads according to the barcode sequence of each sample (or the combination of barcode and plate sequence) and remove the barcode sequences. The resulting reads for each sample were then trimmed with Trimmomatic-0.39 (Bolger et al., 2014) with options TRAILING:3, SLIDINGWINDOW:4:10, MINLEN:30. We used BWA-MEM (Li & Durbin, 2009) with default settings to align the trimmed reads from each of the Salvelinus samples for which we generated GBS data to what was originally considered the AC (Salvelinus alpinus) reference genome sequence (assembly ASM291031v2; Christensen et al., 2018), but which now has been identified as a specimen of NDV with about 5% admixture from AC (Shedko, 2020; Christensen et al., 2021; authors, unpublished data).
References
Alcaide, M., Scordato, E. S. C., Price, T. D., & Irwin, D. E. (2014). Genomic divergence in a ring species complex. Nature, 511(7507), 83–85. https://doi.org/10.1038/nature13285
Bolger, A.M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120.
Catchen, J., Hohenlohe, P.A., Bassham, S., Amores, A., & Cresko, W.A. (2013). Stacks: an analysis tool set for population genomics. Mol Ecol, 22, 3124-3140. https://doi.org/10.1111/mec.12354
Christensen, K.A., Rondeau, E.B., Minkley, D.R., Leong, J.S., Nugent, C.M., Danzmann, R.G., Ferguson, M.M., Stadnik, A., Devlin, R.H., Muzzerall, R., & Edwards, M. (2018). The Arctic charr (Salvelinus alpinus) genome and transcriptome assembly. PloS One, 13(9): p.e0204076.
Christensen, K.A., Rondeau, E.B., Minkley, D.R., Leong, J.S., Nugent, C.M., Danzmann, R.G., Ferguson, M.M., Stadnik, A., Devlin, R.H., Muzzerall, R., & Edwards, M. (2021). Retraction: The Arctic charr (Salvelinus alpinus) genome and transcriptome assembly. PLoS ONE 13(9): e0204076. https://doi.org/10.1371/journal.pone.0204076 PMID: 30212580
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., & Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330
Elshire, R.J., Glaubitz, J.C., Sun, Q., Poland, J.A., Kawamoto, K., Buckler, E.S., & Mitchell, S.E., (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PloS one, 6(5), p.e19379.
Geraldes, A., & Taylor, E.B. (2023). Site C fish genomics project. Annual report to British Columbia Hydro Authority: construction year 8 Vancouver, BC. 66 pp + iv Appendices. https://www.sitecproject.com/sites/default/files/fish-genetics-study-2022-annual-report.pdf
Irwin, D. E., Alcaide, M., Delmore, K. E., Irwin, J. H., & Owens, G. L. (2016). Recurrent selection explains parallel evolution of genomic regions of high relative but low absolute differentiation in a ring species. Molecular Ecology, 25(18), 4488–4507.
Li H., & Durbin R. (2009). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25, 1754-60. [PMID: 19451168]
May‐McNally, S.L., Quinn, T.P., & Taylor, E.B. (2015). Low levels of hybridization between sympatric Arctic char (Salvelinus alpinus) and Dolly Varden char (Salvelinus malma) highlights their genetic distinctiveness and ecological segregation. Ecology and Evolution, 5(15), 3031-3045.
Melnik, N.O., Markevich, G.N., Taylor, E.B., Loktyushkin, A.V., & Esin, E.V. (2020). Evidence for divergence between sympatric stone charr and Dolly Varden along unique environmental gradients in Kamchatka. Journal of Zoological Systematics and Evolutionary Research, 58(4), 1135-1150.
Moore, J.S., Bajno, R., Reist, J.D., & Taylor, E.B. (2015). Post‐glacial recolonization of the North American Arctic by Arctic char (Salvelinus alpinus): Genetic evidence of multiple northern refugia and hybridization between glacial lineages. Journal of Biogeography, 42(11), 2089-2100.
Owens, G.L., Baute, G.J., & Rieseberg, L.H. (2016). Revisiting a classic case of introgression: hybridization and gene flow in Californian sunflowers. Molecular Ecolology, 25, 2630–2643. https://doi.org/10.1111/mec.13569
Rochette, N.C., Rivera-Colón, A.G., & Catchen, J.M. (2019). Stacks 2: Analytical methods for paired-end sequencing improve RADseq-based population genomics. Mol Ecol. 20192, 8: 4737–4754. https://doi.org/10.1111/mec.15253
Shedko, S. V. (2019). Assembly ASM291031v2 (Genbank: GCA_002910315.2) identified as assembly of the Northern Dolly Varden (Salvelinus malma malma) genome, and not the Arctic char (S. alpinus) genome. 1–15. http://arxiv.org/abs/1912.02474
Taylor, E.B., Pollard, S., & Louie, D. (1999). Mitochondrial DNA variation in bull trout (Salvelinus confluentus) from northwestern North America: implications for zoogeography and conservation. Molecular Ecology, 8(7), 1155-1170.
Taylor, E. B., & May-McNally, S. L. (2015). Genetic analysis of Dolly Varden (Salvelinus malma) across its North American range: evidence for a contact zone in southcentral Alaska. Can. J. Fish. Aquat. Sci., 72(7), 1048-1057.
Toews, D.P., Campagna, L., Taylor, S.A., Balakrishnan, C.N., Baldassarre, D.T., Deane-Coe, P.E., Harvey, M.G., Hooper, D.M., Irwin, D.E., Judy, C.D., & Mason, N.A. (2016). Genomic approaches to understanding population divergence and speciation in birds. The Auk: Ornithological Advances, 133(1), 13-30.
Weinstein, S.Y., Gallagher, C.P., Hale, M.C., Loewen, T.N., Reist, J.D., & Swanson, H.K., 2024a. Gill raker and pyloric caeca counts differ between Arctic char (Salvelinus alpinus) and Dolly Varden (S. malma) populations across their ranges. Journal of Fish Biology, 105, 1327-1332. doi.org/10.1111/jfb.15785
Weinstein, S.Y., Gallagher, C.P., Hale, M.C., Loewen, T.N., Power, M., Reist, J.D., Swanson, H.K. (2024b). An updated review of the post-glacial history, ecology, and diversity of Arctic char (Salvelinus alpinus) and Dolly Varden (S. malma). Environmental Biology of Fishes, 107(1), 121-154.
Genotyping by sequencing analysis using Pst1 enzyme. It has been aligned to the Salvelinus sp. reference genome and consists of a variant call format file (vcf) with 350 samples that contains biallelic single nucelotide polymorphims that have a heterozygosity of less than 0.6 (N = 6,601,213).
