Data from: Genomic insights on conservation priorities for North Sea houting and European lake whitefish (Coregonus spp.)

Tengstedt, Aja Noersgaard Buur 1 ; Liu, Shenglin 1 ; W. Jacobsen, Magnus 2 ; Gundlund, Camilla1 ; Rask Møller, Peter 3 ; Berg, Søren2 ; Bekkevold, Dorte2 ; M. Hansen, Michael1

Research facility: Aarhus University

Published Apr 08, 2024; Updated Dec 26, 2024 on Dryad. https://doi.org/10.5061/dryad.qfttdz0r0

Abstract

Population genomics analysis holds great potential for informing conservation of endangered populations. We focused on a controversial case of European whitefish (Coregonus spp.) populations. The endangered North Sea houting is the only coregonid fish that tolerates oceanic salinities and was previously considered a species (C. oxyrhinchus) distinct from European lake whitefish (C. lavaretus). However, no firm evidence for genetic-based salinity adaptation has been available. Also, studies based on microsatellite and mitogenome data suggested surprisingly recent divergence (ca. 2,500 years bp) between houting and lake whitefish. These data types furthermore have provided no evidence for possible inbreeding. Finally, a controversial taxonomic revision recently classified all whitefish in the region as C. maraena, calling conservation priorities of houting into question. We used whole genome and ddRAD sequencing to analyze six lake whitefish populations and the only extant indigenous houting population. Demographic inference indicated postglacial expansion and divergence between lake whitefish and houting occurring not long after the Last Glaciation, implying deeper population histories than previous analyses. Runs of Homozygosity analysis suggested high inbreeding (F_ROH up to 30.6%) in some freshwater populations, but also F_ROH up to 10.6% in the houting prompting conservation concerns. Finally, outlier scans provided evidence for adaptation to high salinities in the houting. Applying a framework for defining conservation units based on current and historical reproductive isolation and adaptive divergence led us to recommend that the houting be treated as a separate conservation unit regardless of species status. In total, the results underscore the potential of genomics to inform conservation practices, in this case clarifying conservation units and highlighting populations of concern.

https://doi.org/10.5061/dryad.qfttdz0r0

Description of the data and file structure

The data encompasses multiple VCF files generated from ddRAD and whole-genome sequencing, respectively.

1) WGS.raw.chr01-40.vcf
Raw variant calls from WGS data.

2) WGS.chr01-40.filtered.collapsed.max2.hw1.miss1.vcf
Variant calls from file 1) filtered to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) and SNPs out of Hardy Weinberg equilibrium or located in putative duplicated regions. Contains only sites with no missing data.

3) WGS.chr01-40.ann.filtered.collapsed.max2.hw1.miss1.LD-pruned.vcf
An LD-pruned subset of markers from file 2).

4) WGS.chr01-40.ann.filtered.collapsed.max2.hw1.miss1.phased.vcf
Variant calls from file 2) with phase information obtained by statistical phasing with WhatsHap and SHAPEIT4.

5) WGS.chr01-40.filtered.collapsed.max2.HWE.minDP10.maxDP40.minGQ20.minQ30.miss1.MappabilityMask.RepeatMask.chr.vcf
Variant calls from file 2) further filtered to discard SNPs in regions of low mappability or repeat regions. Genotypes with low depth (<10) or low quality (<10) have been changed to missing. Contains only sites with no missing data. Chromosome names have been changed to numbers (1-40).

6) WGS.chr01-40.filtered.allSites.collapsed.hw1.vcf
Calls containing both variant and invariant sites filtered to discard SNPs out of Hardy Weinberg equilibrium or located in putative duplicated regions. No filtering for missing data has been performed.

7) RAD.chr01-40.all_snps.vcf
Raw variant calls from ddRADseq data with all SNPs per RAD loci retained.

8) RAD.chr01-40.all_snps.hw1.collapsed.vcf
Variant calls from file 7) further filtered to discard SNPs out of Hardy Weinberg equilibrium or located in putative duplicated regions.

8) RAD.chr01-40.all_snps.hw1.collapsed.phased.vcf
Variant calls from file 8) with phase information obtained by statistical phasing with WhatsHap and SHAPEIT4.

10) RAD.chr01-40.random_snp.vcf
Raw variant calls from ddRADseq data with only one random SNPs per RAD loci retained.

11) RAD.chr01-40.random_snp.hw1.collapsed.vcf
Variant calls from file 10) further filtered to discard SNPs out of Hardy Weinberg equilibrium or located in putative duplicated regions.

Furthermore, two so-called popmap files are supplied - WGS.popmap.txt and RAD.popmap.txt. The popmap files each consists of two columns; the first column contains sample names and the second contains population names.

Change Log

Dec 2024: Update to README wording. No file changes.

A. Sampling and DNA extraction

Samples of lake whitefish were collected in 1995-2012 from five locations in Denmark; brackish populations in the Ringkoebing fjord (RIN) and Nissum fjord (NIS), two lagoons connected with the North Sea, and freshwater populations from Lake Flynder (FLYN), Lake Glenstrup (GLEN) and Lake Nors (NORS); and a brackish population from one German location, Achterwasser (ACHT), a lagoon flowing into the Baltic Sea. Houting were collected from the single extant population in Vidaa (VID), a river with outlet into the Wadden Sea (Fig. 1A). Sampling was conducted by electrofishing (VID) and net fishing (remaining populations). Tissue samples consisted of adipose fin clips stored in ethanol at -20°C. DNA was extracted using either a phenol-chloroform method (Taggart et al., 1992) (ACHT, FLYN) or the E.Z.N.A.® Tissue DNA Kit (OMEGA, Bio-tek, CA, USA) following the manufacturer's recommendations (the remaining samples). In total, 35 individuals were whole-genome sequenced and 95 were ddRAD-sequenced (Table 1). A group of 23 individuals occur in both data sets and were consequently both ddRAD and whole-genome sequenced.

B. Whole-genome sequencing, mapping, and variant calling

Library construction (using insert size ~300 bp) and whole-genome sequencing was outsourced to BGI (Beijing Genomics Institute, Hongkong, China). Paired-end Illumina sequencing was conducted using the Illumina HiSeq 2500 platform with a read length of 150 bp. The sequence reads were mapped to the Coregonus sp. “Balchen” Alpine whitefish reference genome (De-Kayne et al. (2020); GenBank accession: GCA_902810595.1) using BWA-MEM v.0.7.17 (Li, 2013; Li & Durbin, 2009a) with default parameters. SAM format files were sorted, indexed and converted into BAM files using SAMtools v.1.9 (Danecek et al., 2021). Variants were called using BCFtools v.1.2 (Danecek et al., 2021) function mpileup and call with a minimum mapping quality requirement of 20. We used the ‘--multiallelic-caller’ for calling combined with ‘--variants-only’ to output only variant sites. To produce an ‘all sites’ data set containing both monomorphic and polymorphic sites, we repeated the SNP calling process without the ‘--variants-only’ parameter in BCFtools call.

C. WGS data set generation

We filtered the resulting VCF file containing variant sites with VCFutils.pl (Li et al., 2009b) and VCFtools v.0.1.16 (Danecek et al., 2011) to remove indels, monomorphic sites, multi-allelic SNPs and SNPs with a variant quality <20 or extreme depth of coverage (lower than 400 or higher than 1000 across all individuals) determined from the coverage distribution of SNPs (Fig. S1). The bimodal coverage distribution with two distinct peaks suggested the presence of paralogous loci, a well-known issue in salmonid fishes due to their tetraploid origin. In addition to excluding the variants in the higher coverage peak, which was centered at approximately twice the depth of the lower peak and thus likely represented duplicated regions, we also used VCFtools to discard SNPs located within putative duplicated genomic regions identified by De-Kayne et al. (2020). Furthermore, as loci with an excess of heterozygotes can also represent duplicated genomic regions, we removed SNPs out of Hardy-Weinberg equilibrium (HWE) in one or more populations using a custom R script (https://github.com/shenglin-liu/VCF_HWF). Tests for HWE were conducted using the statistic (Brown, 1970), where is Wright’s fixation index within populations and is the sample size. The statistic follows a standard normal distribution with a mean of 0 and a standard deviation of 1. Negative values denote heterozygote excess and positive values heterozygote deficit, and values > |1.96| are significant at the 5 % level. The effects of the individual filtering steps are detailed in Supplementary Table S1. The resulting data set, hereafter referred to as the ‘HW-filtered WGS data set’, contained 16,898,181 SNPs. Additionally, we produced a ‘LD-pruned WGS data set’ with the addition of 5 individuals of the alpine whitefish species C. arenicolus (AREN) as an outgroup (Extended methods S1) by pruning SNPs on the basis of linkage disequilibrium (LD) in the HW-filtered WGS data set. Pruning was performed with the indep-pairwise function in PLINK v.1.9 (Purcell et al., 2007), where SNPs with r²>0.1 were removed from sliding windows of 50 SNPs with 10 SNPs of overlap. A total of 596,078 SNPs remained after pruning.

The ‘all sites’ data set was filtered to remove indels and sites with extreme depth of coverage or located in putative duplicated regions and SNPs not in HWE, as detailed for the ‘variant sites’ data set above. No filtering for minor allele frequency or missing data was performed. After filtering, the VCF contained 1,181,919,736 sites with individuals exhibiting between x and y % missing genotypes.

D. Filtering for ROH analyses

We opted to further filter our the HW-filtered WGS data set to ensure only the most reliable genotype calls were retained. Following the protocol implemented in Balboa et al. (2024), we estimated mappability of the genome assembly with GENMAP v.1.3.0 (Pockrandt et al., 2020) using 100 bp k-mers and allowing for up to two mismatches, and we identified repetitive elements in the assembly with RepeatMasker v.4.1.2 (Smit et al., 2013) using ‘rmblast’ as the search engine and ‘Actinopterygii’ (ray-finned fishes) as the query species. Repeat regions and sites with a mappability score <1 were excluded from the analyses. In addition to the extreme depth filters applied as previously described, we furthermore used VCFtools to change individual genotypes with very low (DP<10) or very high read depth (DP>40) and genotypes with low quality (GQ<30) to missing (./.). Finally, only SNPs with variant quality (QUAL) >30 and no missing data were kept, resulting in a data set containing 2,646,198 SNPs.

E. ddRAD sequencing, mapping, and loci assembly

Samples were prepared using ddRADseq (Peterson et al., 2012). The ddRADseq libraries used PstI (6-base) and MspI (4-base) restriction enzymes. Two libraries of equal size were constructed (using insert size of 200-500 bp) and sequenced on an Illumina HiSeq2000 platform with 100 bp paired-end reads at BGI (Hong Kong, China). Raw reads were cleaned and demultiplexed with process_radtags in Stacks v.2.55 (Catchen et al., 2011; Catchen et al., 2013) in addition to being truncated to 90 bp (-t 90). Low-quality reads (phred score < 10 over a sliding window of 15% of the read length) were discarded. Mapping of reads to the Alpine whitefish reference genome (De-Kayne et al., 2020) progressed as described for the whole-genome sequencing data. Loci were assembled from the aligned and sorted reads using gstacks v.2.55 with default parameters.

F. ddRADseq data set generation

The populations program in Stacks (Catchen et al., 2011; Catchen et al., 2013) was used to generate a preliminary VCF file including only loci present in all six populations (-p 6; GLEN was not analyzed by ddRAD sequencing) and at least 70 percent of individuals within each population (-r 0.7). Exports were ordered (--ordered-export) to ensure that only a single representative of each overlapping site was included. Loci out of HWE in one or more populations were filtered out using a custom R script, as previously described. Based on this data set, five individuals (two from ACHT, one from each of the populations NIS, NORS, and RIN) with more than 10 % missing data were identified. We then generated a new VCF file excluding these five individuals using populations with parameters as previously stated, yielding a total of 347,397 SNPs, and a second VCF file with data analysis restricted to one random SNP per locus, yielding 141,157 SNPs. Both files were filtered to remove SNPs located within potentially duplicated regions of the genome (De-Kayne et al., 2020) and SNPs out of HWE in one or more populations as described for WGS data. A total of 254,693 SNPs and 105,452 SNPs, respectively, remained after filtering.