Skip to main content

Data from: The laboratory domestication of zebrafish: from diverse populations to inbred substrains

Cite this dataset

Suurväli, Jaanus et al. (2019). Data from: The laboratory domestication of zebrafish: from diverse populations to inbred substrains [Dataset]. Dryad.


We know from human genetic studies that practically all aspects of biology are strongly influenced by the genetic background, as reflected in the advent of ‘personalized medicine’. Yet, with few exceptions, this is not taken into account when using laboratory populations as animal model systems for research in these fields. Laboratory strains of zebrafish (Danio rerio) are widely used for research in vertebrate developmental biology, behaviour and physiology, for modelling diseases, and for testing pharmaceutic compounds in vivo. However, all of these strains are derived from artificial bottleneck events and therefore are likely to represent only a fraction of the genetic diversity present within the species.

Here we use Restriction site-Associated DNA sequencing (RAD-seq) to genetically characterize wild populations of zebrafish from India, Nepal and Bangladesh, and to compare them to previously published data on four common laboratory strains. We measured nucleotide diversity, heterozygosity and allele frequency spectra, and find that wild zebrafish are much more diverse than laboratory strains. Further, in wild zebrafish there is a clear signal of GC-biased gene conversion that is missing in laboratory strains. We also find that zebrafish populations in Nepal and Bangladesh are most distinct from all other strains studied, making them an attractive subject for future studies of zebrafish population genetics and molecular ecology. Finally, isolates of the same strains kept in different laboratories show a pattern of ongoing differentiation into genetically distinct substrains. Together, our findings broaden the basis for future genetic, physiological, pharmaceutic and evolutionary studies in Danio rerio.


All data used for producing this dataset originates from RAD-sequencing of wild and laboratory zebrafish genomic DNA digested with SbfI. Sequences were mapped to GRCz11 using BWA-MEM (version 0.7.17-r1188). Samtools (version 1.9) was used to filter out unmapped reads and non-primary alignments. Variant calling itself was performed with Stacks (version 2.4).Variants were annotated using Ensembl Variant Effect Predictor.

Usage notes

The allele in the "reference" column of the .vcf was determined in the associated study itself and does not always correspond to the reference genome.


  • CHT - Wild fish from Chittagong, Bangladesh
  • KHA - Wild fish from Khair Khola, Nepal
  • UT - Wild fish from Uttarbhag, India
  • CB - Wild derived fish from Cooch Behar, India
  • AB, EKW, Nadia, TU, WIK - laboratory strains

Files in the dataset:

  • wildlabdanio.vcf.gz - polymorphic sites identified in data from wild and laboratory zebrafish
  • wildlabdanio.sumstats.tsv - output file from the populations module of Stacks v2.4
  • wildlabdanio.vep.tsv - functional annotation of variants in the .vcf
  • wildlabdanio.vep_summary.html - summary of the functional annotation


Deutsche Forschungsgemeinschaft, Award: SPP1819

National Science Foundation, Award: DEB-1652278

National Science Foundation, Award: IOS-1257562