Genome-wide SNP datasets for the non-native pink salmon in Norway
Data files
Feb 05, 2024 version files 78.66 MB
-
OGO-3RAD-D2-NEUTRAL-SNPS.vcf
-
README.md
Abstract
Effective management of non-indigenous species requires knowledge of their dispersal factors and founder events. We aim to identify the main environmental drivers favouring dispersal events along the invasion gradient and to characterize the spatial patterns of genetic diversity in feral populations of the non-native pink salmon within its epicentre of invasion in Norway. We first conducted SDM using four modelling techniques with varying levels of complexity, which encompassed both regression-based and tree-based machine-learning algorithms, using climatic data from the present to 2050. Then we used the triple-enzyme restriction-site associated DNA sequencing (3RADseq) approach to genotype over 30,000 high-quality single-nucleotide polymorphisms to elucidate patterns of genetic diversity and gene flow within the pink salmon putative invasion hotspot. We discovered temperature- and precipitation-related variables drove pink salmon distributional shifts across its non-native ranges, and that climate-induced favourable areas will remain stable for the next 30 years. In addition, all SDMs identified north-eastern Norway as the epicentre of the pink salmon invasion, and genomic data revealed that there was minimal variation in genetic diversity across the sampled populations at a genome-wide level in this region. While, upon utilizing a specific group of ‘diagnostic’ SNPs, we observed a significant degree of genetic differentiation, ranging from moderate to substantial, and detected four hierarchical genetic clusters concordant with geography. Our findings suggest that fluctuations of climate extreme events associated with ongoing climate change will likely maintain environmental favourability for the pink salmon outside its ‘native’/introduced ranges. Local invaded rivers are themselves a potential source population of invaders in the ongoing secondary spread of pink salmon in Northern Norway. Our study shows that SDMs and genomic data can reveal species distribution determinants and provide indicators to aid in post-control measures and potential inferences of their success.
README: Genome-wide SNP datasets for the non-native pink salmon in Norway
The complete single nucleotide polymorphisms (SNPs) dataset underwent several filtering steps, including thinning SNPs to a density of one SNP per kilobase, removing closely related individuals, eliminating candidate paralogous regions of the genome (known as multi-site variants or MSVs), and excluding SNPs within non-chromosomal scaffolds. This resulted in a final panel of 43,719 polymorphic SNPs, with a genotyping rate of 0.98 and a sample size of 73 individuals. We eliminated all SNPs that were potentially influenced by selection, as determined by two genome scans for outlier tests. As a result, we obtained a final dataset consisting of 33,860 SNPs that were considered to be neutral. From this dataset, we derived a SNP subset dataset of 250 'diagnostic' SNPs with the highest locus-specific FST.
Description of the data and file structure
The ‘neutral’ full-SNP dataset: OGO-3RAD-D2-NEUTRAL-SNPS.vcf
NOTE: Population codes are encoded in sample names as the initial three characters, separated by an underscore, from the sample ID within the VCF files.
Methods
3RAD library preparation and sequencing: We prepared RADseq libraries using the Adapterama III library preparation protocol of Bayona-Vásquez et al., (2019; their Supplemental File SI). For each sample, ~40-100 ng of genomic DNA were digested for 1 h at 37 °C in a solution with 1.5 µl of 10x Cutsmart® buffer, 0.25 µl (NEB®) of Read 1 enzyme (MspI) at 20 U/µl, 0.25 µl of Read 2 enzyme (BamHI-HF) at 20 U/µl, 0.25 µl of Read 1 adapter dimer-cutting enzyme (ClaI) at 20 U/ µl, 1 µl of i5Tru adapter at 2.5 µM, 1 µl of i7Tru adapter at 2.5 µM and 0.75 µl of dH2O. After digestion/ligation, samples were pooled and cleaned with 1.2x Sera-Mag SpeedBeads (Fisher Scientiifc™) in a 1.2:1 (SpeedBeads:DNA) ratio, and we eluted cleaned DNA in 60 µL of TLE. An enrichment PCR of each sample was carried with 10 µl of 5x Kapa Long Range Buffer (Kapa Biosystems, Inc.), 0.25 µl of KAPA LongRange DNA Polymerase at 5 U/µl, 1.5 µl of dNTPs mix (10 mM each dNTP), 3.5 µl of MgCl2 at 25 mM, 2.5 µl of iTru5 primer at 5 µM, 2.5 µl of iTru7 primer at 5 µM and 5 µl of pooled DNA. The i5 and i7 adapters ligated to each sample using a unique combination (2 i5 X 1 i7 indexes). The temperature conditions for PCR enrichment were 94 °C for 2 min of initial denaturation, followed by 10 cycles of 94 °C for 20 sec, 57 °C for 15 sec and 72° for 30 sec, and a final cycle of 72 °C for 5 min. The enriched samples were each cleaned and quantified with a Quantus™ Fluorometer. Cleaned, indexed and quantified library pools were pooled to equimolar concentrations and were sent to the Norwegian Sequencing Centre (NSC) for quality control and subsequent final size selection using a one-sided bead clean-up (0.7:1 ratio) to capture 550 bp +/- 10% fragments, and the final paired-end (PE) 150 bp sequencing on one lane each of the Illumina HiSeq 4000 platform.
Data filtering: We filtered genotype data and characterized singleton SNP loci and multi-site variants (MSVs) using filtering procedures and custom scripts available in scripts available in STACKS Workflow v.2 (https://github.com/enormandeau/stacks_workflow). First, we filtered the ‘raw’ VCF file keeping only SNPs that (i) showed a minimum depth of four (-m 4), (ii) were called in at least 80% of the samples in each site (-p 80) and (iii) and for which at least two samples had the rare allele i.e., Minor Allele Sample (MAS; -S 2), using the python script 05_filter_vcf_fast.py. Second, we exclude those samples with more than 20% missing genotypes from the data set. Third, we calculated pairwise relatedness between samples with the Yang et al., (2010) algorithm and individual-level heterozygosity in vcftools v.0.1.17 (Danecek et al., 2010). Additionally, we calculated pairwise kinship coefficients among individuals using the KING-robust method (Manichaikul et al., 2010) with the R package SNPRelate v.1.28.0 (Zheng et al., 2012). Then, we estimated genotyping error rates between technical replicates using the software tiger v1.0 (Bresadola et al., 2020). Finally, we removed one of the pair of closely related individuals exhibiting the higher level of missing data along with samples that showed extremely low heterozygosity (< -0.2) from graphical observation of individual-level heterozygosity per sampling population. Fourth, we conducted a secondary dataset filtering step using 05_filter_vcf_fast.py, keeping the above-mentioned data filtering cut-off parameters (i.e., -m = 4; -p = 80; -S = 3). Fifth, we calculated a suit of four summary statistics to discriminate high-confidence SNPs (singleton SNPs) from SNPs exhibiting a duplication pattern (duplicated SNPs; MSVs): (i) median of allele ratio in heterozygotes (MedRatio), (ii) proportion of heterozygotes (PropHet), (iii) proportion of rare homozygotes (PropHomRare) and (iv) inbreeding coefficient (FIS). We calculated each parameter from the filtered VCF file using the python script 08_extract_snp_duplication_info.py. The four parameters calculated for each locus were plotted against each other to visualize their distribution across all loci using the R script 09_classify_snps.R. Based on the methodology of McKinney et al. (2017) and by plotting different combinations of each parameter, we graphically fixed cut-offs for each parameter. Sixth, we then used the python script 10_split_vcf_in_categories.py for classify SNPs to generate two separate datasets: the “SNP dataset,” based on SNP singletons only, and the “MSV dataset,” based on duplicated SNPs only, which we excluded from further analyses. Seventh, we postfiltered the SNP dataset by keeping all unlinked SNPs within each 3RAD locus using the 11_extract_unlinked_snps.py script with a minimum difference of 0.5 (-diff_threshold 0.5) and a maximum distance 1,000 bp (-max_distance 1,000). Then, for the SNP dataset, we filtered out SNPs that were located in unplaced scaffolds i.e., contigs that were not part of the 26 chromosomes of the pink salmon genome.