Novel Megaptera novaeangliae (Humpback whale) haplotype reference genome
Data files
Aug 19, 2024 version files 3.38 GB
-
GIU3625_Humpback_whale.protein.fasta.gz
7.70 MB
-
GIU3625_Humpback_whale.RepeatMasked.fasta.gz
889.90 MB
-
HAP1_final_assembly.agp
134.60 KB
-
HAP2_final_assembly.agp
104.89 KB
-
HumpbackWhale_Final_Genome_forNCBI.fasta
2.48 GB
-
name_chromosomes_both_haplotypes.txt
1.82 KB
-
README.md
5.47 KB
Nov 15, 2024 version files 3.62 GB
-
GIU3625_Humpback_whale.annotation.gff.gz
6.70 MB
-
GIU3625_Humpback_whale.busco_eukaryota_odb10.tar.gz
12.48 MB
-
GIU3625_Humpback_whale.busco_mammalia_odb10.tar.gz
144.21 MB
-
GIU3625_Humpback_whale.protein.fasta.gz
7.70 MB
-
GIU3625_Humpback_whale.RepeatMasked.fasta.gz
889.90 MB
-
GIU3625_Humpback_whale.RepeatMasked.gff.gz
63.98 MB
-
GIU3625_Humpback_whale.transcript.fasta.gz
11.92 MB
-
HAP1_final_assembly.agp
134.60 KB
-
HAP2_final_assembly.agp
104.89 KB
-
HumpbackWhale_Final_Genome_forNCBI.fasta
2.48 GB
-
methods.txt
2.06 KB
-
name_chromosomes_both_haplotypes.txt
1.82 KB
-
README.md
12.79 KB
Abstract
The sequencing of a kidney sample (KW2013002) from a stranded Megaptera novaeangliae (Humpback whale) calf is the first chromosome level reference genome for this species. The calf, a 457 cm and 2,500 lbs male, was found stranded in Hawai’i Kai, HI, in 2013 and was marked as abandoned/orphaned. In 2023, 1g of kidney was sequenced with PacBio long-read DNA sequencing, chromatin conformation capture (Hi-C), RNA sequencing, and mitochondrial sequencing to comprehensively characterize the genome and transcriptome of M. novaeangliae. The reference genome was compared to the preexisting M. novaeangliae scaffold to determine assembly improvements. Data validation includes a synteny analysis, mitochondrial annotation, and a comparison of BUSCO scores (scaffold v. reference genome and Balaenoptera musculus (Blue whale) v. M. novaeangliae). BUSCO analysis was performed on an M. novaeangliae scaffold-level assembly to determine genomic completeness of the reference genome, with a scaffold BUSCO score of 91.2% versus a score of 95.4% (Table I). Synteny analysis was performed using the B. musculus genome as comparison to determine chromosome level coverage and structure. Further, a time-based phylogenetic tree was constructed using the sequenced data and publicly available genomes.
This dataset also contains the results of de novo repeat identification and gene annotation for the Humpback whale (Megaptera novaeangliae) genome. The repeat families were identified and classified using RepeatModeler, and gene prediction was conducted using AUGUSTUS and SNAP, incorporating coding sequences from related cetaceans. The resulting gene models were further refined using the MAKER pipeline, with protein evidence from Swiss-Prot and related species. tRNA genes were identified with tRNAscan-SE. The dataset includes the transcript sequences (GIU3625_Humpback_whale.transcript.fasta.gz), annotation file (GIU3625_Humpback_whale.annotation.gff.gz), and a methods file (methods.txt) detailing the bioinformatic processes.
README: Novel Megaptera novaeangliae (Humpback whale) haplotype reference genome
https://doi.org/10.5061/dryad.dv41ns271
Description of the data and file structure
The sequencing of a kidney sample (KW2013002) from a stranded Megaptera novaeangliae (Humpback whale) calf is the first chromosome level reference genome for this species. The calf, a 457 cm and 2,500 lbs male, was found stranded in Hawai’i Kai, HI, in 2013 and was marked as abandoned/orphaned. In 2023, 1g of kidney was sequenced with PacBio long-read DNA sequencing, chromatin conformation capture (Hi-C), RNA sequencing, and mitochondrial sequencing to comprehensively characterize the genome and transcriptome of M. novaeangliae. The reference genome was compared to the preexisting M. novaeangliae *scaffold to determine assembly improvements. Data validation includes a synteny analysis, mitochondrial annotation, and a comparison of BUSCO scores (scaffold v. reference genome and *Balaenoptera musculus *(Blue whale) v. *M. novaeangliae). BUSCO analysis was performed on an *M. novaeangliae *scaffold-level assembly to determine genomic completeness of the reference genome, with a scaffold BUSCO score of 91.2% versus a score of 95.4% (Table I). Synteny analysis was performed using the *B. musculus *genome as comparison to determine chromosome level coverage and structure. Further, a time-based phylogenetic tree was constructed using the sequenced data and publicly available genomes.
Files and variables
File: HumpbackWhale_Final_Genome_forNCBI.fasta
Description: This file contains the final assembly of the Humpback whale genome, formatted for submission to the NCBI database. HumpbackWhale_Final_Genome_forNCBI.fasta file was generated using hic-hifiasm followed by HiRise scaffolding, with Pacbio reads and OmniC reads as input. It includes the complete genomic sequence representing both haplotypes and has been carefully curated to ensure accuracy and completeness. Use this file for detailed genomic studies and comparisons. This file can be viewed and analyzed using genome browsers such as UCSC Genome Browser, Ensembl, or IGV (Integrative Genomics Viewer). Command-line tools like samtools
and bedtools
can also be used. Any text editor (e.g., Notepad++, Sublime Text, or VSCode) can be used to view this file.
File: name_chromosomes_both_haplotypes.txt
This text file lists the names and identifiers of the chromosomes included in the Humpback whale genome assembly, covering both haplotypes. It is essential for reference when mapping genomic sequences to specific chromosomal locations in the assembly. Any text editor (e.g., Notepad++, Sublime Text, or VSCode) can be used to view this file. It can also be imported into spreadsheet software like Microsoft Excel for easier viewing and sorting.
File: GIU3625_Humpback_whale.protein.fasta.gz
This compressed FASTA file contains the predicted protein sequences derived from the Humpback whale genome (ID: GIU3625). These protein sequences are useful for functional annotation, comparative genomics, and evolutionary studies. To read and analyze this file, users can use tools like BLAST, EMBOSS, or any sequence alignment software. Decompression can be done with tools like gunzip
or similar utilities.
File: HAP1_final_assembly.agp
This file describes the HAP1 haplotype of the Humpback whale genome assembly using the AGP (A Golden Path) format. It provides the layout of the scaffolds and contigs, mapping their positions and orientations. This file is crucial for understanding the structural organization of the HAP1 haplotype. To view and interpret this file, use text editors like Notepad++, or for more advanced processing, use tools like AGPtools or any genome assembly software.
File: HAP2_final_assembly.agp
Similar to the HAP1 file, this AGP file details the HAP2 haplotype of the Humpback whale genome assembly. It defines the scaffold and contig arrangement and should be used alongside the HAP1 file for comprehensive haplotype analysis. Similar to the HAP1 AGP file, this can be viewed with text editors or processed using AGPtools or genome assembly software.
File: GIU3625_Humpback_whale.RepeatMasked.fasta.gz
This compressed FASTA file contains the repeat-masked version of the Humpback whale genome sequence. Repetitive elements have been masked to facilitate easier identification of unique sequences, which is important for downstream analyses such as gene annotation and variant discovery. Can be viewed with browsers like UCSC Genome Browser, Ensembl, or IGV. Command-line tools such as RepeatMasker
, samtools
, and bedtools
can be used for analysis. Decompression can be done with `gunzip'.
File: GIU3625_Humpback_whale.annotation.gff.gz
Description: This file contains gene annotation data in GFF format for the Humpback whale genome. It describes the predicted gene structures, including exons, coding sequences (CDS), and other genomic features. This file is essential for functional genomic studies and can be viewed with genome browsers like IGV, UCSC Genome Browser, or Ensembl. You can also use text editors or command-line tools such as grep
for specific queries.
File: GIU3625_Humpback_whale.busco_eukaryota_odb10.tar.gz
Description: This file contains the results of BUSCO analysis using the eukaryota_odb10 dataset to assess the completeness of the Humpback whale genome. BUSCO scores indicate how many highly conserved orthologs were found, offering insights into the quality of the genome assembly. The file can be viewed with any text editor and analyzed further using BUSCO-related tools.
File: GIU3625_Humpback_whale.busco_mammalia_odb10.tar.gz
Description: Similar to the above, this file contains BUSCO results for the Humpback whale genome, but it uses the mammalia_odb10 dataset, providing a specific evaluation of the completeness based on mammalian orthologs.
File: GIU3625_Humpback_whale.transcript.fasta.gz
Description: This compressed FASTA file contains the predicted transcript sequences for the Humpback whale genome, including both coding and non-coding RNA transcripts. The file can be decompressed using gunzip
and viewed with text editors, or analyzed further with tools like STAR or BLAST for transcriptomic studies.
File: GIU3625_Humpback_whale.RepeatMasked.gff.gz
Description: This compressed FASTA file contains the repeat-masked version of the Humpback whale genome sequence, where repetitive elements have been masked. This facilitates downstream analysis like gene annotation or variant discovery by focusing on unique sequences. Decompression can be done using gunzip
, and the file can be viewed in genome browsers like UCSC Genome Browser, IGV, or Ensembl.
Code/software
The data analyses were performed according to the manuals and protocols provided by the developers of the respective bioinformatics tools and no custom code was used in the execution of the study. All software and codes used in this work are publicly accessible and are described in each of the descriptions above.
Software Required to View the Data
The data included in this submission can be accessed and analyzed using several free and open-source bioinformatics tools. Below is a description of the key software packages used, including versions, necessary for viewing and working with the files:
JupiterPlot 1.0,11
A synteny analysis was performed using JupiterPlot 1.0,11 a software tool that uses circos-based consistency plots to map a given set of scaffolds with a reference genome
RepeatModeler (v2.0.1)
- Purpose: Used for de novo repeat identification in the genome.
- Description: RepeatModeler automatically detects repeat families in the genome, which can then be used for repeat masking.
- Dependencies/Packages: RECON (v1.08) and RepeatScout (v1.0.6) are required dependencies.
- Access: Available at https://www.repeatmasker.org/RepeatModeler/.
RepeatMasker (v4.1.0)
- Purpose: Masks repetitive elements in the genome sequence identified by RepeatModeler.
- Description: RepeatMasker screens DNA sequences for interspersed repeats and low complexity regions.
- Access: Available at https://www.repeatmasker.org/.
AUGUSTUS (v2.5.5)
Purpose: Used for ab initio gene prediction.
Description: AUGUSTUS is a program that predicts genes in eukaryotic genomes based on coding sequences from related species.
Access: Available at http://bioinf.uni-greifswald.de/augustus/.
SNAP (v2006-07-28)
- Purpose: Another tool for gene prediction.
Description: SNAP is used for training gene models based on coding sequences from related species.
Access: Available at http://korflab.ucdavis.edu/software.html.
STAR Aligner (v2.7)
- Purpose: Aligns RNA-Seq reads to the reference genome.
- Description: STAR is a splice-aware aligner used to map RNA sequences to the genome, essential for intron-exon boundary detection.
- Access: Available at https://github.com/alexdobin/STAR.
MAKER (version 2.31.10)
- Purpose: Combines evidence from multiple sources to generate high-quality gene predictions.
- Description: MAKER is a genome annotation pipeline that integrates ab initio predictions, RNA-Seq evidence, and protein homology to generate gene models.
- Access: Available at https://www.yandell-lab.org/software/maker.html.
BUSCO (v5.2.2)
- Purpose: Evaluates genome completeness.
- Description: BUSCO searches for highly conserved orthologs in the genome to determine the completeness of gene predictions.
- Access: Available at https://busco.ezlab.org/.
tRNAscan-SE (v2.05)
- Purpose: Used for tRNA gene prediction.
- Description: tRNAscan-SE identifies transfer RNA genes within the genome sequence.
- Access: Available at http://trna.ucsc.edu/tRNAscan-SE/.
BLAST
- Purpose: Compares gene and protein sequences to the UniProt database.
- Description: BLAST is a tool for searching nucleotide or protein databases to find regions of similarity.
- Access: Available at https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Workflow and File Relationships
- RepeatModeler and RepeatMasker were used to identify and mask repetitive elements in the genome assembly, as represented in the
GIU3625_Humpback_whale.RepeatMasked.fasta.gz
andGIU3625_Humpback_whale.RepeatMasked.gff
files. - AUGUSTUS and SNAP were used for gene prediction, with models trained on coding sequences from related species.
- STAR Aligner mapped RNA-seq reads to the genome, which provided hints for the gene prediction models.
- MAKER integrated gene predictions and evidence from protein databases, producing the final gene models.
- BUSCO was used to assess genome completeness, and the results are presented in
GIU3625_Humpback_whale.busco_eukaryota_odb10
andGIU3625_Humpback_whale.busco_mammalia_odb10
.
tRNAscan-SE identified tRNA genes within the genome.
All software mentioned is freely accessible, and the links provided lead to their respective download or information pages. No custom scripts were used in this analysis.
Access information
Other publicly accessible locations of the data:
- NCBI
Data was derived from the following sources:
- NIST sample
Version Change
November 2024: Additional files were uploaded to provide a more comprehensive dataset for the Humpback whale genome and its analysis. The RepeatMasked.fasta.gz file contains the genome sequence with repetitive elements masked to facilitate downstream analyses such as annotation and variant detection. The RepeatMasked.gff.gz file details the locations and types of these repetitive elements. The annotation.gff.gz file provides genome annotations, including features such as genes, exons, and regulatory elements, while the transcript.fasta.gz file contains RNA sequences derived from the annotated genes. To assess the quality and completeness of the genome assembly, BUSCO analyses were conducted using both eukaryota (busco_eukaryota_odb10.tar.gz) and mammalia-specific (busco_mammalia_odb10.tar.gz) ortholog databases. The results indicate the assembly's performance against universal and lineage-specific benchmarks. A methods.txt file accompanies these datasets, describing the methodologies used for genome assembly, annotation, and analyses.
Methods
Sample Information
A kidney sample (KW2013002) was collected from a M. novaeangliae calf on January 15, 2013, in Hawai’i Kai, HI, and deposited at the National Institutes of Standards and Technology (NIST). The sample was not collected by the authors so information regarding collection is limited to that presented herein. The calf, a 457 cm and 2,500 lbs male at the time of necropsy, was first observed on January 14, 2013, in shallow water and died between January 14 and January 15, 2013, via stranding. The calf was marked as abandoned/orphaned. In 2023, 1g of KW2013002 was sampled for sequencing by Cantata Bio.
PacBio long reads DNA sequencing
Quantification of DNA samples was performed using the Qubit 2.0 Fluorometer. For the construction of the PacBio SMRTbell library, targeting an insert size of approximately 20kb, the SMRTbell Express Template Prep Kit 2.0 was employed following the manufacturer's recommended protocol and default settings. The library was subsequently prepared for sequencing by binding to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto the PacBio Sequel II system. Sequencing was executed using PacBio Sequel II 8M SMRT cells to ensure comprehensive coverage and high-quality reads.
Quality control of the extracted DNA was performed using nanodrop and gel. The OmniC library quality control was done using the Hifiasm draft assembly and showed a high amount of long-range linkage reads. The OmniC sequencing data was also quality controlled to examine Q30%, and the quality score matched the Illumina standard. The scaffolding algorithm HiRise also has a built-in quality control that uses only reads with a map score of over 40.
Chromatin was fixed in situ within the nucleus using formaldehyde, followed by digestion with DNase I. The processed chromatin had its ends repaired and was then ligated to a biotinylated bridge adapter, facilitating proximity ligation of adapter-containing ends. Post-proximity ligation, the crosslinks were reversed, and the DNA was purified—a critical step involved treating the purified DNA to eliminate any non-internal biotin. The sequencing libraries were prepared using NEBNext Ultra enzymes and Illumina-compatible adapters, with biotin-containing fragments isolated using streptavidin beads before PCR enrichment. Sequencing was performed on an Illumina HiSeqX platform to achieve approximately 30x coverage.
Contig assembling and scaffolding
The de novo assembly process utilized PacBio CCS reads and Omni-C reads as input for HiC-Hifiasm, employing default parameters. This approach facilitated the generation of a separate de novo assembly for each haplotype, enhancing the accuracy and integrity of the genomic reconstruction.
The scaffolding phase involved the integration of the de novo assembly with Dovetail Omni-C library reads through HiRise, a software pipeline tailored for scaffolding genome assemblies using proximity ligation data. Alignment of Omni-C library sequences to the draft assembly was achieved using bwa, with the mapped read pairs analyzed by HiRise to construct a likelihood model for genomic distance (See Figure S1). This model, along with additional information from the synteny analysis (see below), informed the identification and correction of misjoins, the scoring of potential joins, and the execution of joins exceeding a defined confidence threshold.
Synteny analysis
The M. novaeangliae newly-assembled scaffolds were mapped to the B. musculus whole genome (GenBank GCA_009873245.3) in order to map the synteny between the two species.9,10 A synteny analysis was performed using JupiterPlot 1.0,11 a software tool that uses circos-based consistency plots to map a given set of scaffolds with a reference genome.
RNA sequencing
Total RNA was extracted employing the QIAGEN RNeasy Plus Kit, adhering to the manufacturer's instructions. Quantification of RNA involved the Qubit RNA Assay and the TapeStation 4200 system. Before library preparation, DNase treatment was applied, followed by AMPure bead cleanup and rRNA depletion using QIAGEN FastSelect -HMR. The NEBNext Ultra II RNA Library Prep Kit was used for library preparation per the manufacturer's protocols. Sequencing of the prepared libraries was conducted on the NovaSeq 6000 platform, utilizing a 2 x 150 bp configuration to ensure comprehensive transcriptome coverage.
Repeat Analysis
This dataset was derived from a Humpback whale (Megaptera novaeangliae) genome assembly. The repeat families found in the genome were identified de novo using RepeatModeler (v2.0.1), which relies on RECON (v1.08) and RepeatScout (v1.0.6). The custom repeat library generated from RepeatModeler was then used to discover, identify, and mask the repeats in the assembly using RepeatMasker (v4.1.0).
Gene prediction was performed using the AUGUSTUS software (v2.5.5) with six rounds of optimization. Coding sequences from related cetacean species, including Balaenoptera acutorostrata, Balaenoptera musculus, Balaenoptera ricei, Megaptera novaeangliae, and Orcinus orca, were used to train the ab initio models for gene prediction. Additionally, the SNAP software (v2006-07-28) was trained using the same coding sequences to build a separate gene prediction model.
RNA-seq reads were mapped to the genome using the STAR aligner (v2.7), and intron hints were generated using the bam2hints tool within AUGUSTUS. MAKER was then employed to integrate the predictions from AUGUSTUS and SNAP, combining this information with peptide evidence from the UniProt database and protein sequences from related cetacean species. Only gene models predicted by both AUGUSTUS and SNAP were retained in the final dataset. Annotation Edit Distance (AED) scores were generated for each predicted gene as part of the MAKER pipeline to assess the accuracy of the predictions.
Finally, tRNA genes were identified using the tRNAscan-SE software (v2.05).