Skip to main content

Data from: Post‐glacial establishment of locally adapted fish populations over a steep salinity gradient

Cite this dataset

Leder, Erica H. et al. (2020). Data from: Post‐glacial establishment of locally adapted fish populations over a steep salinity gradient [Dataset]. Dryad.


Studies of colonization of new habitats that appear from rapidly changing environments are interesting and highly relevant to our understanding of divergence and speciation. Here, we analyse phenotypic and genetic variation involved in the successful establishment of a marine fish (sand goby, Pomatoschistus minutus) over a steep salinity drop from 35 PSU in the North Sea (NE Atlantic) to two PSU in the inner parts of the post-glacial Baltic Sea. We first show that populations are adapted to local salinity in a key reproductive trait, the proportion of motile sperm. Thereafter, we show that genome variation at 22,190 single nucleotide polymorphisms (SNPs) shows strong differentiation among populations along the gradient. Sequences containing outlier SNPs and transcriptome sequences, mapped to a draft genome, reveal associations with genes with relevant functions for adaptation in this environment but without overall evidence of functional enrichment. The many contigs involved suggest polygenic differentiation. We trace the origin of this differentiation using demographic modelling and find the most likely scenario is that at least part of the genetic differentiation is older than the Baltic Sea and is a result of isolation of two lineages prior to the current contact over the North Sea–Baltic Sea transition zone.


Genomic DNA was extracted using a simplified CTAB-protocol. In short, fragments of fins and muscles were digested in CTAB-buffer, mercaptoethanol and Proteinase K. RNA was digested by RNase A. Thereafter, DNA extracted with chloroform-isoamylate and precipitated with isopropanol. The DNA was washed with 70% alcohol and stored in 10% TE-buffer at −70°C before being sent to SciLife Stockholm for sequencing.

Thirteen libraries of varied insert size (160–510 bases) and read length (101–301 bases) were generated and used for paired-end sequencing on an Illumina HiSeq 2000 or 2500. In addition, two long-insert size mate-pair libraries (2 and 5 kb) were sequenced on Illumina HiSeq 2500. Sequences were processed with cutadapt v1.12 (Martin, 2011) to remove partial adapters and bad quality sequences. Paired reads with one read shorter than 40 bases were also removed. Flash v1.2.11 (Magoč & Salzberg, 2011) was used to merge paired-end reads that had 10 or more bases overlapping. Lighter v1.1.1 (Song, Florea, & Langmead, 2014) was then used to correct sequencing errors via kmer correction. Sequences were scanned using a kmer length of 23 and an estimated genome size of 0.9 GB. Reads were trimmed or discarded if they could not be corrected. A python script ( was used to re-pair fastq files since Lighter may remove one sequence of a pair without removing the mate. Soapdenovo2 v2.04 (Luo et al., 2012) was then used to assemble a draft genome using kmer size 29, and GapCloser v1.12 was used to fill in the gaps in the scaffolds. Input sequence libraries, insert sizes, read numbers and the Soapdenovo2 config file are available in the manuscript Supplementary Information.

RNA was extracted from tissues with phenol-based phase separation (Tri-reagent, Ambion, Life Technologies) following the Trizol-protocol from SIGMA with minor modifications as follows: 1-bromo-3-chloropropane (SIGMA) was used instead of chloroform, and RNA samples were washed twice with 75% EtOH. After the first extraction with Tri-reagent DNAse, treatment with RQ1 (RNAsefree DNAse, Promega) was performed to ensure no traces of DNA were left in the RNA. To remove any residue of the enzyme and DNA, the RNA samples were extracted a second time with amounts of reagents downscaled for 400 µl of Tri-reagent. The RNA samples were dissolved into 30 µl of nuclease-free H2O. The RNA concentration was measured with Nanodrop prior to analysing quality on the Bioanalyzer (Agilent). 

RNA sequences from testicular tissue from one individual collected at Kristineberg in 2012 and brain, liver, spleen and ovary from two individuals (brain from one and liver, spleen and ovary from the other) collected at Tjärnö in June 2013 were used to assemble transcriptomes using Trinity v2.6.6 (Grabherr et al., 2011) and parameters for a stranded library (RF), no read normalization and a minimum kmer coverage of 2. GMAP v2019-06-10 (Wu, Reeder, Lawrence, Becker, & Brauer, 2016) was used to map the assembled transcripts to the reference genome in order to annotate coding regions. We used Diamond (Buchfink, Xie, & Huson, 2015) to functionally annotate each transcript, and similarity searches were performed for each transcript against the zebrafish, Danio rerio (Ensembl 96 and Uniprot Reference Proteome, release 2019_04) and mudskipper, Periophthalmus magnuspinnatus (Ensembl 96) protein databases using the parameters --sensitive --frameshift 15 --top 5 --min-score 100 --id 60 --unal 1 --outfmt 6 and keeping the blast trace-back operations string (btop) and the query sequence matching the target (qseq). The logic for this procedure is that frameshifts are often introduced into the transcript sequence during the Trinity assembly. By using the strand information provided by the btop string to adjust the frameshift, a single open reading frame can be obtained from the transcript. We parsed the query sequence according to the btop string, thus removing any frameshifts and then translated them using the tool fastatranslate (in reading frame one) from Exonerate v2.2. CD-HIT v4.6.8 (Li & Godzik, 2006) was used to cluster the transcript sequences in order to remove some of the redundant amino acid sequences (one representative sequence per cluster where retained). The resulting sequences were used in Orthofinder to determine the most likely orthologues (Emms & Kelly, 2015). In order to identify orthogroups in Orthofinder, we used the Ensembl 96 protein databases for tigertail seahorse (H. comes), three-spined stickleback, zebrafish, and mudskipper as well as the Uniprot reference proteome (release 2019_4) for zebrafish.


Usage notes

The assembled genome in fasta format is provided. It contains all assembled scaffolds and contigs over 100 bases: SandGobyGenome_v1.fa.gz.

The transcriptomes for each tissue are provided in a tar archive transcritomeFasta.tar (within the archive in the form: tissueName.fasta.gz) and for each tissue the Gff3 file transcritomeGff3.tar (within the archive in the form: tissueName.gff3.gz) generated by GMAP (mapping assembled transcripts back to genome). These Gff3 files were used to annotate the outliers in the genome by linkage of an expressed transcript to an outlier position. The gene information was predicted using Orthofinder based on the predicted amino acid sequences for each transcript as described in the manuscript: (all tissues merged - All_AAseqs.fa.gz).


Academy of Finland, Award: 136464

Swedish Research Council, Award: 2016-03343

Swedish Research Council, Award: 217-2008-1719

Swedish Research Council, Award: 621-2011-4004

Research Foundation - Flanders, Award: W0.037.10 N