Supporting Data for: Ocean-wide conservation genomics of blue whales suggest new Northern Hemisphere subspecies
Data files
Jan 08, 2025 version files 272.63 MB
-
genomeSNPs.vcf.gz
256.78 MB
-
mt_Marker.vcf.gz
35.75 KB
-
mt_variances.vcf.gz
1.41 MB
-
README.md
4.21 KB
-
y_variances.vcf.gz
14.40 MB
Abstract
The blue whale is an endangered and globally distributed species of baleen whale with multiple described subspecies assignments, including the morphologically and molecularly distinct pygmy blue whale, among others. North Atlantic and North Pacific populations, however, are currently regarded as a single subspecies despite being separated by continental land masses and differences in their acoustic communication. To determine the degree of isolation among the Northern Hemisphere populations, fourteen North Pacific and six Western Australian blue whale nuclear and mitochondrial genomes were sequenced and analyzed combinedly with eleven publicly available North Atlantic blue whale genomes. This allowed to contrast the genetic differentiation and genetic exchange among Northern Hemisphere populations to the Western Australian pygmy blue whale subspecies. Population genomic analyses revealed distinctly differentiated clusters and limited exchange among all three populations, indicating a high degree of isolation. Nevertheless, the genomic and mitogenomic distances between all blue whale populations, including the Western Australian pygmy blue whale, are low when compared to other inter-subspecies distances in cetaceans. Given that the pygmy blue whale is an already recognized subspecies and further supported by previously reported acoustic differences, a proposal is made to treat the two Northern Hemisphere populations equally as subspecies, namely Balaenoptera musculus musculus (North Atlantic blue whale) and Balaenoptera musculus sulfureus (North Pacific blue whale). Furthermore, a first molecular vitality assessment of all three populations found a generally high genomic diversity among blue whales but also a lack of rare alleles, non-neutral evolution and substantially increased effects of inbreeding. This suggests a substantial anthropogenic impact on the genotypes of blue whales and calls for careful monitoring in future conservation plans.
README: Supporting Data for: Ocean-wide conservation genomics of blue whales suggest new Northern Hemisphere subspecies
[Access this dataset on Dryad: https://doi.org/10.5061/dryad.47d7wm3jz]
We sequenced the genomes from 20 blue whale specimens gathered from the North Pacific blue whale and West-Australian pygmy blue whale populations and analyze the data together with 12 publicly available genomes of other blue whales, including those of the North-Atlantic blue whales (Jossey et al., 2024). The data was used for SNP calling of genomic, mitogenomic, mt_Marker region, and y-chromosomal data. The SNPs were subsequently used for population genetic analyses regarding gene flow, genetic divergence, phylogenetic reconstruction and genetic viability.
Description of the data and file structure
In this data repository, we provide filtered, high-quality SNPs called from our genomic resequencing and subsequent mapping to the blue whale reference genome of the Vertebrate Genome Project (link see below). The data is devided between variances of genomic, mitogenomic, mt_Marker region, and y-chromosomal origin.
Sharing/Access information
This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.
Raw re-sequencing data can be found at:
- [https://www.ncbi.nlm.nih.gov/bioproject?LinkName=sra_bioproject&from_uid=27424897]
- [https://doi.org/10.1111/mec.17619]
Raw re-sequencing data generated by Jossey et al. 2024 can be found at:
- [https://www.ncbi.nlm.nih.gov/bioproject?LinkName=sra_bioproject&from_uid=14242536]
- [https://doi.org/10.1007/s10592-023-01584-5]
Data was mapped to the following reference genome:
- [https://www.genomeark.org/genomeark-all/Balaenoptera_musculus.html]
- [https://doi.org/10.1093/molbev/msae036]
Variance data
1.) genomeSNPs.vcf.gz: A set of high quality genome-wide single nucleotide polymorphisms (SNPs), filtered in various ways as outlined in the methods section. SNPs are contained in a standard zipped VCF file. Properties of a VCF files explained in (Danecek et al., 2011). Use “gunzip genomeSNPs.vcf.gz” to extract original file.
2.) mt_variances.vcf.gz: A set of filtered variances (including monomorphic sites) of whole mitogenomic data, filtered in various ways as outlined in the methods section. Variances are contained in a standard zipped VCF file. Properties of a VCF files explained in (Danecek et al., 2011). Use “gunzip mt_variances.vcf.gz” to extract original file.
3.) y_variances.vcf.gz: A set of filtered variances (including monomorphic sites) of whole Y-Chromosomal data, filtered in various ways as outlined in the methods section. Variances are contained in a standard zipped VCF file. Properties of a VCF files explained in (Danecek et al., 2011). Use “gunzip y_variances.vcf.gz” to extract original file.
4.) mt_Marker.vcf.gz: A set of filtered variances (including monomorphic sites) of the 403 bp mitochondrial control region (Rosel, Dizon, & Heyning, 1994). Variances are contained in a standard zipped VCF file. Properties of a VCF files explained in (Danecek et al., 2011). Use “gunzip mt_Marker.vcf.gz” to extract original file.
Usage notes
All variance sets are contained in zipped vcf files and might be viewed and altered with BCFTOOLS (Danecek et al., 2021). While the genome wide set only contains SNPs, the haploid mitochondrial and sex-chromosomal data contain also conserved sites!
Code/Software
RESEQ-to-Popanalyses_v.1.0.tar.gz: A tar ball including the main pipeline script (written in UNIX bash) and all necessary sub-scripts and sub files. Use tar -xzvf RESEQ-to-Popanalyses.tar.gz RESEQ-to-Popanalyses to reverse this. An extensive README file will explain the installation and the usage.
Methods
We sequenced the genomes from 20 blue whale specimens gathered from the North Pacific blue whale and West-Australian pygmy blue whale populations and analyze the data together with 12 publicly available genomes of other blue whales, including those of the North-Atlantic blue whales (Jossey et al., 2024). This sampling is further complemented by three genomes of the closely related sei whale (Balaenoptera borealis), of which one is sequenced in this study.
All Illumina paired-end libraries were prepared by Novogene, Cambridge, United Kingdom using the NEBNEXT DNA LIBRARY PREP kit with a read length of 150 base pairs (bp) and an insert size of 350 bp. Illumina sequencing was performed on a NovaSeq 6000 platform targeting ~20x coverage per individual.
A comprehensive pipeline used to process the data and perform many of the here presented downstream analyses can be found on GitHub: mag-wolf/RESEQ-to-Popanalyses/.
Short read data were trimmed for quality and adapter sequences using FASTP V0.23.2 (Chen, Zhou, Chen, & Gu, 2018) with the options “-g -3 -l 40 -y -c -p”. Trimmed reads were mapped to a repeat-masked, high-quality blue whale reference genome, constructed by the authors of the Vertebrate Genome Project (Bukhman et al., 2024). Mapping was performed using BWA MEM v0.7.17-r1188 (http://bio-bwa.sourceforge.net) and SAMtools v1.9 sort (Danecek et al., 2021) using default settings. Potential duplicates were removed and read-groups were added using the PICARD v2.21.2-0 toolkit (https://broadinstitute.github.io/picard/). Genotype-calling of autosomes, sex-chromosomes and mitochondrial genomes was done on all individuals combined and on each individual independently using BCFTOOLS v1.12 MPILEUP and BCFTOOLS v1.12 CALL (Danecek et al., 2021) with the respective “-m” or “-c” flag and minimum mapping- and base-quality cutoffs of 20 and 13, respectively. For the sex-chromosome and mitochondrial genome data, BCFTOOLS CALL was run with the “--ploidy 1” flag to account for the haploidy. All inferred sites were further filtered by excluding sites with divergent read coverage (>3-fold and <0.3-fold of the expected individual mean coverage) and sites with more than 25% missing data using the BCFtools filter function. The combined genotype set was further processed by removing multivariate and monomorphic sites to retrieve single nucleotide polymorphisms (SNPs) with VCFTOOLS v0.1.16 (Danecek et al., 2011) and by thinning SNPs to account for the effects of linkage disequilibrium using BCFTOOLS THINN function with a window size of 1000 bp. To receive variances of the mitochondrial control region (mtMarker), we extracted respective 403 bp long area from the mitogenomic vcf file using BCFTOOLS VIEW “-r” (See Rosel, et al., 1994).
References:
Jossey S, Haddrath O, Loureiro L, et al. (10 co-authors). 2024. Population structure and history of North Atlantic Blue whales (Balaenoptera musculus musculus) inferred from whole genome sequence analysis. Conserv Genet. 25(2):357–371.
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics (Oxford, England), 34(17), i884-i890. https://doi.org/10.1093/bioinformatics/bty560
Bukhman YV, Morin PA, Meyer S, et al. (33 co-authors). 2024. A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography. Molecular biology and evolution. 41(3).
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., . . . Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics (Oxford, England), 27(15), 2156–2158. https://doi.org/10.1093/bioinformatics/btr330
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., . . . Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2). https://doi.org/10.1093/gigascience/giab008
Rosel, P. E., Dizon, A. E., & Heyning, J. E. (1994). Genetic analysis of sympatric morphotypes of common dolphins (genus Delphinus). Marine Biology, 119(2), 159–167. https://doi.org/10.1007/BF00349552
Usage notes
Usage Notes:
All variance sets are contained in zipped vcf files and might be viewed and altered with BCFTOOLS (Danecek et al., 2021). While the genome wide set only contains SNPs, the haploid mitochondrial and sex-chromosomal data contain also conserved sites!