Skip to main content

Genomics of humic adaptation in Eurasian perch (Perca fluviatilis): SNP genotypes of 32 perch individuals, supplementary figures and tables

Cite this dataset

Ozerov, Mikhail et al. (2022). Genomics of humic adaptation in Eurasian perch (Perca fluviatilis): SNP genotypes of 32 perch individuals, supplementary figures and tables [Dataset]. Dryad.


Extreme environments are inhospitable to the majority of species, but some organisms are able to survive in such hostile conditions due to evolutionary adaptations. For example, modern bony fishes have colonized various aquatic environments, including perpetually dark, hypoxic, hypersaline and toxic habitats. Eurasian perch (Perca fluviatilis) is among the few fish species of northern latitudes that is able to live in very acidic humic lakes. Such lakes represent almost “nocturnal” environments; they contain high levels of dissolved organic matter, which in addition to creating a challenging visual environment, also affects a large number of other habitat parameters and biotic interactions. To reveal the genomic targets of humic-associated selection, we performed whole-genome sequencing of perch originating from 16 humic and 16 clear-water lakes in northern Europe. We identified over 800,000 SNPs, of which >10,000 were identified as potential candidates under selection (associated with >3,000 genes) using multiple outlier approaches. Our findings suggest that adaptation to the humic environment may involve hundreds of regions scattered across the genome. Putative signals of adaptation were detected in genes and gene families with diverse functions, including organism development and ion transportation. The observed excess of variants under selection in regulatory regions highlights the importance of adaptive evolution via regulatory elements, rather than via protein sequence modification. Our study demonstrates the power of whole-genome analysis to illuminate multifaceted nature of humic adaptation and provides the foundation for further investigation of causal mutations underlying phenotypic traits of ecological and evolutionary importance.


Biological samples

In total, 32 perch individuals were sampled from 16 humic and 16 clear-water lakes in northern Europe (a single specimen per lake, as in Jones et al. (2012); Fig. 1a, Table S1). The selection of lakes for the analysis was based on drastic differences in water color (medianhumic = 212.5; range = 95.0 – 765.0 vs. medianclear = 20.0; range = 7.5 – 60.0, non-parametric Mann-Whitney test P < 0.001, Fig. S1) and geographic proximity of different type of lakes to increase the power of detecting loci under selection and to minimize the effect of population structuring (De Mita et al., 2013; Lotterhos & Whitlock, 2015; Hoban et al. 2016). Most fish were caught using a gill-net, beach seine or rod (Table S1). Egg ribbons were collected from two Swedish lakes (Nedre Björntjärnen and Snottertjärn) and transported to laboratory facilities (Institute of Freshwater Research, Drottningholm, Sweden). After hatching, fry were euthanized using a benzocaine overdose and stored in 96% ethanol without measuring fork length, body mass or determining sex. The fish from the remaining lakes were sacrificed with a sharp blow to the head. Thereafter, fork or total length (to the nearest mm) and wet body mass (to the nearest g) were measured, sex was determined by visual examination of the gonads, and a tissue sample (pelvic fin) was placed in 96% ethanol for DNA extraction.

DNA extraction and sequencing

Total genomic DNA (gDNA) was extracted from the tissue samples using NucleoSpin® Tissue kit (Macherey-Nagel, Dueren, Germany) according to the manufacturer’s protocol. The quality of the total gDNA was assessed using Fragment Analyzer (Advanced Analytical), and the concentration was measured using Qubit® Fluorometric Quantitation (Life Technologies). Sequencing libraries (average insert size 350 bp) were constructed from 1 mg of gDNA for each individual according to the Illumina TruSeq® DNA PCR-Free Library Preparation Guide (part #15036187). A unique Illumina TruSeq indexing adapter was ligated to each gDNA sample. The samples were normalized and pooled for the automated cluster preparation using Illumina cBot station. 16 libraries (each corresponding to one perch sample from Estonia) were pooled and sequenced in 8 lanes on Illumina HiSeq 3000 using paired-end sequencing (2 × 150 bp read length with 8 bp index), and another 16 libraries (each corresponding to one perch sample from Finland, Lithuania and Sweden) were pooled and sequenced in 4 lanes on Illumina NovaSeq 6000 using paired-end sequencing (2 × 150 bp read length with 8 bp index).

SNP calling and filtering

Read quality was assessed using FastQC v.0.11.8 (Andrews, 2017). Illumina adapters, as well as short (< 60 bp) and low quality reads (average quality score < 25) were trimmed using Trimmomatic v.0.36 (Bolger, Lohse, & Usadel, 2014; CROP:149 HEADCROP:9 SLIDINGWINDOW:5:25 MINLEN:60) for the 16 samples sequenced on the HiSeq 3000 instrument, or fastp v.0.20 (Chen, Zhou, Chen, & Gu, 2018) for the 16 samples sequenced on the NovaSeq 6000 instrument, with the same parameters (-g -w 12 -r -W 5 -M 25 --trim_front1 9 --trim_front2 9 --trim_tail1 2 --trim_tail2 2 -l 60) due to the excess of polyG tails in the latter (Arora et al., 2019; De-Kayne et al., 2021).

Filtered sequence reads of each individual were mapped to the Eurasian perch reference genome (NCBI: GCA_010015445.1) using Bowtie2 v. (Langmead, Trapnell, Pop, & Salzberg, 2009), applying default parameters except the modified score minimum threshold (--score-min L,-0.3,-0.3) and maximum fragment length for valid paired-end alignments (-X 700). Mean coverage (d) across all 32 perch individuals was 29.9, and varied from 24.0 to 40.6. Raw SNPs were called using two alternative pipelines. First, SAMtools v.1.10 (Li, 2011) was applied on the locally realigned and sorted BAM files (samtools mpileup -uIg -t DP,AD,INFO/AD,ADF,ADR,SP -q 20) with the following variant calling using bcftools v.1.8 (Li, 2011). Second, HaplotypeCaller subroutine from GATK v. (McKenna et al., 2010) with similar parameters was applied to call variants using the same BAM files (-ERC GVCF --minimum-mapping-quality 20 -mbq 13 --indel-size-to-eliminate-in-ref-model 12 -G AS_StandardAnnotation -G StandardAnnotation) with the following import of single-sample GVCFs into GenomicsDB using GenomicsDBImport, and final calling of consensus genotypes with GenotypeGVCFs. Further quality filtering of raw variants generated by the two pipelines was performed using VCFtools v.0.1.15 (Danecek et al., 2011) retaining only the variants that met the following criteria: (i) minimum and maximum mean sequencing depth (d) was set to 10 (as a minimum recommended by Illumina) and 66 (max depth = dmax + 4√dmax; Li (2014)), respectively; (ii) the consensus quality was ≥ 30; (iii) a variant had at least two copies of an allele; (iv) no missing data were allowed; (v) only bi-allelic sites were included; (vi) a variant did not occur in repetitive genomic regions (--max-meanDP 66 --min-meanDP 10 --max-missing 1 --mac 2 --min-alleles 2 --max-alleles 2 --minQ 30 --exclude-bed Pfluv_repeats.bed).

In total, 1,078,577 and 1,074,879 SNPs were retained after samtools-bcftools and GATK pipelines, respectively. To ensure the quality and reliability of the dataset, only the variants consistently called by both pipelines were retained (1,025,544 SNPs). To further exclude false heterozygous calls and regions with an excess of variable sites (such as multilocus contigs, paralogs, repetitive elements and otherwise superficially similar sequences; Hohenlohe, Amish, Catchen, Allendorf, & Luikart, 2011; Willis, Hollenbeck, Puritz, Gold, & Portnoy, 2017; O'Leary, Puritz, Willis, Hollenbeck, & Portnoy, 2018), the additional quality control filtering was applied: number of heterozygotes per SNP ≤ 16, MAF ≥ 0.05, which resulted in 810,591 SNP loci in the final dataset.

The identified SNPs were annotated using SnpEff v.5.0 (Cingolani et al., 2012). The SnpEff database was generated using the Eurasian perch reference genome sequence and annotation (NCBI: GCA_010015445.1).


Swedish Research Council, Award: 2020-03916

Estonian Research Council, Award: PRG852

Ella and Georg Ehrnrooth Foundation, Award: Mikhail Ozerov (personal grant)

Swedish University of Agricultural Sciences, Award: Anti Vasemägi (start-up grant)