Supporting Data for: The genome of the pygmy right whale illuminates the evolution of rorquals
Data files
Mar 24, 2023 version files 7.29 GB
-
Cmar_C18_SBIK-F_TBG_v1_amino.fasta.gz
-
Cmar_C18_SBIK-F_TBG_v1_annotation.gff.gz
-
Cmar_C18_SBIK-F_TBG_v1_cds.fasta.gz
-
Cmar_C18_SBIK-F_TBG_v1_funct-annotation.gff3.gz
-
Cmar_C18_SBIK-F_TBG_v1.fasta.gz
-
Phylo_SBIK-F_TBG_SCOS_align.tar.gz
-
Phylo_SBIK-F_TBG_SCOS_trees.tar.gz
-
Phylo_SBIK-F_TBG_SCOS.tar.gz
-
Phylo_SBIK-F_TBG_SNPs.vcf.gz
-
Phylo_SBIK-F_TBG_WGA_v1_fragments.tar.gz
-
Phylo_SBIK-F_TBG_WGA_v1_fragmenttrees.tar.gz
-
Phylo_SBIK-F_TBG_WGA_v1.maf.gz
-
README.md
-
Select_SBIK-F_TBG_SCOS_axt.tar.gz
-
Select_SBIK-F_TBG_SCOS_clustaln.tar.gz
-
Select_SBIK-F_TBG_SCOS.tar.gz
Abstract
Baleen whales are a clade of gigantic and highly specialized marine mammals. Their genomes have been used to investigate their complex evolutionary history and to decipher the molecular mechanisms that allowed them to reach these dimensions. However, many unanswered questions remain, especially about the early radiation of rorquals and how cancer resistance interplays with their huge number of cells. The pygmy right whale is the smallest and most elusive among the baleen whales. It reaches only a fraction of the body length compared to its relatives and it is the only living member of an otherwise extinct family. This placement makes the pygmy right whale genome an interesting target to update the complex phylogenetic past of baleen whales, because it splits up an otherwise long branch that leads to the radiation of rorquals. Apart from that, genomic data of this species might help to investigate cancer resistance in large whales, since these mechanisms are not as important for the pygmy right whale as in other giant rorquals and right whales.
Results
Here, we present a first de novo genome of the species and test its potential in phylogenomics and cancer research. To do so, we constructed a multi-species coalescent tree from fragments of a whole-genome alignment and quantified the amount of introgression in the early evolution of rorquals. Furthermore, a genome wide comparison of selection rates between large and small bodied baleen whales revealed a small set of conserved candidate genes with potential connections to cancer resistance.
Conclusions
Our results suggest that the evolution of rorquals is best described as a hard polytomy with a rapid radiation and high levels of introgression. The lack of shared positive selected genes between different large-bodied whale species supports a previously proposed convergent evolution of gigantism and hence cancer resistance in baleen whales.
Methods
Author for correspondence: Magnus Wolf (Magnus.Wolf@senckenberg.de)
The here deposited data is the result of a whole genome sequencing project of the pygmy right whale (Caperea marginata, Gray 1846). Apart of the genome construction, this project includes a phylogenomic revision of the rorqual clade and a positive selection analysis to find genes related to body size and hence cancer resistance in baleen whales. This deposition is composed of:
Code to create phylogenomic trees:
1.) A zip file including the main script written in UNIX bash as well as the necessary subscripts and an extensive README file containing necessary instructions. (filename: GEMOMA-to-Phylogeny.zip)
Genome data (Cmar):
1.) A raw whole genome assembly without changes made by NCBI in fasta format. (filename: Cmar_C18_SBIK-F_TBG_v1.fasta.gz)
2.) A homology-based genome annotation of the newly constructed genome, including a gff table and an amino acid fasta file. (filename gff: Cmar_C18_SBIK-F_TBG_v1_annotation.gff.gz; filename fasta: Cmar_C18_SBIK-F_TBG_v1_amino.fasta.gz)
3.) A functional annotation of found proteins via InterProScan in gff3 format. (filename: Cmar_C18_SBIK-F_TBG_v1_funct-annotation.gff3.gz)
Phylogenomic revision of rorquals (Phylo):
1.) A whole genome alignment (WGA) off publicly available baleen whale genome assembly in “multiple alignment format” (maf). (filename: Phylo_SBIK-F_TBG_WGA_v1.maf.gz)
2.) A tar ball containing 46,941 quality filtered 20kbp long WGA fragments for consensus tree construction in fasta alignment format. (filename: Phylo_SBIK-F_TBG_WGA_v1_fragments.tar.gz)
3.) A tar ball containing 46,941 maximum likelihood trees, one for each WGA fragment, in newick format. (filename: Phylo_SBIK-F_TBG_WGA_v1_fragmenttrees.tar.gz)
4.) A tar ball containing 1774 Single Copy Orthologous Sequences (SCOS) collected to estimate branch lengths in fasta format. (filename: Phylo_SBIK-F_TBG_SCOS.tar.gz)
5.) A tar ball containing 563 quality filtered SCOS alignments in fasta format. (filename: Phylo_SBIK-F_TBG_SCOS_align.tar.gz)
6.) A tar ball containing 563 maximum likelihood trees, one for each SCOS, in newick format. (Phylo_SBIK-F_TBG_SCOS_trees.tar.gz)
7.) A set of single nucleotide polymorphisms (SNPs) created from mapping all available short read data to an outgroup reference genome in vcf format. (filename: Phylo_SBIK-F_TBG_SNPs.vcf.gz)
Selection Analysis (Select):
1.) A tar ball containing 1326 SCOS collected only between whales important for our cancer testing together with the human reference genome GRCh38 in fasta format. (filename: Select_SBIK-F_TBG_SCOS.tar.gz)
2.) A tar ball containing 1326 SCOS trimmed alignments in ClustalN format. (filename: Select_SBIK-F_TBG_SCOS_clustaln.tar.gz)
3.) A tar ball containing 1326 SCOS alignments used for Ka/Ks calculation in AXT format. (filename: Select_SBIK-F_TBG_SCOS_axt.tar.gz)
Methods Annotation:
A de novo whole genome assembly was created with Chromium 10x linked-short reads and was uploaded to NCBI (Genome: JANTQK000000000, BioSample: SAMN29592900). We further modeled repetitive sequences of the assembly with Repeatmodeler v2 (www.repeatmasker.org) and combined found sequences with the Cetartiodactyla repeat database from RepBase (Jurka et al, 2005). Repeatmasker v4.1 (www.repeatmasker.org) was subsequently used to mask those combined sequences in the assembly. A homology-based genome annotation was conducted using the GeMoMa pipeline (Keilwagen et al, 2019) and proteome data from all publicly available repositories (Details in the Supplement of the main manuscript). The resulting annotation was further functionally specified via InterProScan v5 (Jones et al, 2014).
Methods Phylogenomics:
A WGA was created using the genome of the bottlenose dolphin (NCBI Genome: GCA_011762595.1 mTurTru1.mat.Y) following the overall workflow presented in Hecker and Hiller (2020) and most of the respective tools are available on github.com (hillerlab/GENOMEALIGNMENTTOOLS). Please review this paper for further details. The filtered alignment was used to create individual consensus sequences that were further trimmed using Bedtools (Quinlan, 2014). Fragments were subsequently created using the tools presented in (Coimbra et al, 2021). We further filtered too conserved and too variable genes (5% lowest and 5% highest) based on the maximum likelihood distance calculated by IQTREE v.2.1.2 (Minh et al, 2020). For each fragment, a phylogenetic tree was constructed using IQTREE with 1000 bootstrap replications.
To calculate branch lengths for our consensus tree, we compiled a set of SCOS using annotations created by the GeMoMa pipeline. A respective pipeline regarding the generation of SCOS datasets and downstream phylogenetic analysis can be found on github.com (mag-wolf/GEMOMA-to-Phylogeny). Within this pipeline, assemblies are re-annotated with GeMoMa, before using said annotations in ortholog calling with ORTHOFINDER v.2.5.2 (Emms and Kelly, 2019). SCOS were then aligned using MAFFT v.7.475 (Nakamura et al, 2018) and trimmed using CLIPKIT v.1.1.3 (Steenwyk et al, 2020) using the “-m kpic-smart-gap” flag. We further filtered SCOS alignments by maximum likelihood distance too before using them in downstream analyses like tree construction with IQTREE.
A set of high-quality SNPs was created by mapping available short read data of baleen whales to the bowhead whale genome (Keane et al, 2015) with BWA-mem v0.7.17-r1188 (http://bio-bwa.sourceforge.net). Variances were called by BCFtools v1.12 mpileup (Danecek et al, 2021) with the respective “-c” flag and minimal mapping- and base-quality cutoffs of 30. These genotypes were additionally filtered for a too divergent read coverage (>3-fold and <0.3-fold of the expected mean coverage) and for sites with a too high proportion of missing data (5%) using BCFtools v1.12 filter (Danecek et al, 2021). The filtered vcf file that still contained monomorphic sites was used to generate biallelic SNPs with VCFTOOLS v.0.1.16 (Danecek et al, 2011). SNPs were pruned for linkage disequilibrium using the BCFTOOLS plugin “+prune” applying a r2=0.9 cutoff.
Methods Selection Analysis
To collect as many informative orthologs between the whale species included in the selection analysis as possible, we re-run the GEMOMA-to-Phylogeny pipeline as described above in the phylogenetic section. Doing so, we inferred SCOS between all candidate whales together with the human reference genome GRCh38. To ensure that alignments were constructed without frameshifts, we first translated nucleotide sequences to amino acid sequences using the EMOSS v6.6.0.0 transeq (Madeira et al, 2022) tool before generating multiple sequence alignments with Mafft. Amino acid alignments were then converted back to codon alignments using Pal2Nal v14 (Suyama et al, 2006) using the “-nogap” function to remove gaps as well as inframe stop codons. To avoid alignment errors being accounted for in downstream Ka/Ks analyses we removed alignments with the five percent topmost genetic distances using the maximum likelihood distance calculated by IQTree. Filtered codon alignments were then converted into axt files using AXTConverter (Wang et al, 2010).
References:
Coimbra RTF, Winter S, Kumar V, Koepfli K-P, Gooley RM, Dobrynin P, et al. Whole-genome analysis of giraffe supports four distinct species. Curr Biol. 2021;31(13):2929-2938.e5. doi:10.1016/j.cub.2021.04.033.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8. doi:10.1093/bioinformatics/btr330.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience 2021. doi:10.1093/gigascience/giab008.
Emms DM, Kelly S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20(1):238. doi:10.1186/s13059-019-1832-y.
Hecker N, Hiller M. A genome alignment of 120 mammals highlights ultraconserved element variability and placenta-associated enhancers. Gigascience 2020. doi:10.1093/gigascience/giz159.
Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–40. doi:10.1093/bioinformatics/btu031.
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 2005;110(1-4):462–7. doi:10.1159/000084979.
Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, et al. Insights into the evolution of longevity from the bowhead whale genome. Cell Rep. 2015;10(1):112–22. doi:10.1016/j.celrep.2014.12.008.
Keilwagen J, Hartung F, Grau J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol Biol. 2019;1962:161–77. doi:10.1007/978-1-4939-9173-0_9.
Madeira F, Pearce M, Tivey ARN, Basutkar P, Lee J, Edbali O, et al. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res 2022. doi:10.1093/nar/gkac240.
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, Haeseler A von, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37(5):1530–4. doi:10.1093/molbev/msaa015.
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics. 2018;34(14):2490–2. doi:10.1093/bioinformatics/bty121.
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014;47:11.12.1-34. doi:10.1002/0471250953.bi1112s47.
Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biol. 2020;18(12):e3001007. doi:10.1371/journal.pbio.3001007.
Suyama M, Torrents D, Bork P. PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34(Web Server issue):W609-12. doi:10.1093/nar/gkl315.
Wang D, Zhang Y, Zhang Z, Zhu J, Yu J. KaKs_Calculator 2.0: A Toolkit Incorporating Gamma-Series Methods and Sliding Window Strategies. Genomics, Proteomics & Bioinformatics. 2010;8(1):77–80. doi:10.1016/S1672-0229(10)60008-3.
Usage notes
General Usage:
Many files containing sequence data are zipped using gzip. Use “gunzip” to reverse this. Also, directories containing many sub-files are compiled in a tar ball. Use “tar -xzvf” to open the directory first.
Usage Annotation Data:
The assembly as well as the cds and amino acid sequences are in typical fasta format and can be viewed by any type of text editor. The gene ID within all these files are named after the best hit within one of the used reference annotations used for homology-based annotation.
Usage Phylogenomics Data:
All alignments including the WGA, WGA fragments and SCOSs are in fasta alignment format and can again be opened by any text editor. To better understand their quality however, we recommend alignment viewing software like AliView (http://genocat.tools/tools/aliview.html). SCOS raw sequences are in regular fasta format and can be opened with any text editor. Within WGA sequences, header represent a short 6- character long species identified made from their scientific name. All SCOS gene IDs and hence header are denoted by first naming the source species and then the species from the reference annotation, separated by a “-“ symbol. SNPs are contained within a vcf file that can be opened in any text editor. Trees, regardless of WGA trees or SCOS trees, are in newick format and can be opened by any type of phylogenetic program. To view and annotate trees we recommend the ITOL webserver: https://itol.embl.de/.
Usage Selection Analysis Data:
Raw SCOS sequences are in fasta format and can be opened by any text editor. Header name the species first, followed by the gene ID usually describing their homology-based source organism. Alignments, used for Ka/Ks calculation are in ClustalN format and can be opened by any text editor, however AliView (http://genocat.tools/tools/aliview.html) is also able to open them. AXT alignment files are specifically created to be used by the software KaKs_Calculator v2 (Wang et al, 2010). They can be opened by any text editor but were originally only constructed to identify non-synonymous and synonymous mutations.
References:
Wang D, Zhang Y, Zhang Z, Zhu J, Yu J. KaKs_Calculator 2.0: A Toolkit Incorporating Gamma-Series Methods and Sliding Window Strategies. Genomics, Proteomics & Bioinformatics. 2010;8(1):77–80. doi:10.1016/S1672-0229(10)60008-3.