We present a comprehensive annotation of the humpback whale (Megaptera novaeangliae) genome, representing the most complete and biologically coherent gene-model resource available to date for this species. Leveraging R (v2025.05.1+513) within a fully reproducible analysis pipeline, we evaluated 1,308,486 genomic features—including 21,833 protein-coding genes and 386,765 annotated exons. The structural features of our annotation are consistent with mammalian expectations (median gene length ≈ 4.3 kb; intron lengths peaking in the 1–10 kb range) and show coding‐sequence integrity (99.7 % of transcripts are in-frame). Gene loci are distributed non‐uniformly across scaffolds, with densities ranging from ~7 to ~15 genes per megabase, covering ~25–30 % of the assembled genome sequence. We further identified ~2,000 multi‐copy gene families and numerous tandem duplication clusters (2–12 members). These results demonstrate that the M. novaeangliae genome is both structurally complete and functionally rich, providing a robust foundation for future comparative, physiological, and conservation‐genomics investigations.

1. General Information

Submission Title: Genome annotation and structural validation of the humpback whale Megaptera novaeangliae
Contact Information: Maria-Vittoria G. Carminati (m.carminati@deepbiotech.com)
Dataset Overview: This dataset contains the primary Gene Transfer Format (GTF) annotation and derived validation files for the Megaptera novaeangliae genome. These files provide the evidence for the quality and structure of 21,833 protein-coding genes and 386,765 exons.

2. Terminology and Units for New Users

Base pair (bp): The fundamental unit of DNA length. All genomic coordinates and lengths in this dataset are in bp.
Megabase (Mb): One million base pairs.
Scaffold / seqnames: Unique identifiers for the DNA sequence fragments in the genome assembly.
Isoform / Transcript: Alternative versions of mRNA produced from a single gene.
Intergenic Space: The DNA sequence located between two adjacent genes.
Coding Sequence (CDS): The portion of a gene that codes for protein translation.

3. Primary Data Record: Megaptera_novaeangliae_annotated.gtf

Description: The primary structural annotation file describing the location of genes and exons.
Format: Tab-separated (GTF 2.2).
Variables (Columns):
1. seqname: Name of the genomic scaffold.
2. source: Prediction software (BRAKER3).
3. feature: Element type (gene, transcript, exon, CDS, start_codon, stop_codon).
4. start / end: 1-based coordinates in bp.
5. score: Confidence value (represented as "." if not applicable).
6. strand: Orientation on the DNA (+ or -).
7. frame: Reading frame (0, 1, 2, or .).
8. attribute: Semicolon-separated tags including gene_id, transcript_id, and gene_name.

4. Derived Tabular Data (CSV and TXT)

00_summary.txt

Summary of total gene and transcript counts.

01_multicopy_gene_families.csv

gene_name: (String) Standardized name of the gene family.
n: (Integer) Total count of gene copies identified.

02_transcripts_per_gene_derived.csv

gene_id: (String) Unique gene identifier.
n_tx: (Integer) Number of isoforms per gene.

03_exons_per_transcript.csv

transcript_id: (String) Unique transcript identifier.
n_exons: (Integer) Total exons in the transcript.

04_genes_per_seq_all.csv

seqnames: (String) Scaffold name.
n_genes: (Integer) Total genes on the scaffold.

05_gene_density_per_seq.csv

seqnames: (String) Scaffold name.
approx_len: (Integer) Length of scaffold in bp.
n_genes: (Integer) Total gene count.
gene_density_perMb: (Float) Density expressed as (n_genes / approx_len * 1,000,000).

06_intergenic_distances_primary.csv

gene_id: (String) Gene ID (or NA for scaffold start regions).
seqnames: (String) Scaffold name.
start / end: (Integer) Coordinates in bp.
intergenic: (Integer) Distance to the preceding gene in bp.

07_gene_density_1Mb_windows.csv

seqnames: (String) Scaffold name.
bin: (Integer) The starting coordinate of the 1,000,000 bp window.
genes_per_Mb: (Integer) Gene count within that specific window.

08_tandem_duplication_clusters.csv

gene_name: (String) Name of the duplicated gene family.
cluster: (Integer) Numerical ID of the cluster.
size: (Integer) Number of genes in the cluster.
span_bp: (Integer) Total genomic length covered by the cluster in bp.

09_qc_cds_mod3.txt

Technical report verifying that Coding Sequence (CDS) lengths are multiples of 3.

exon_lengths.csv / intron_lengths.csv

exon_len / intron_len: (Integer) Length of the individual feature in bp.

overlapping_gene_pairs.csv

geneA / geneB: (String) Overlapping Gene IDs.
pair_type: (Category) same-strand or opposite-strand.

genes_near_scaffold_edges.csv

dist_to_end: (Integer) Distance to the nearest scaffold end in bp.
near_edge: (Boolean) TRUE if the gene is at the edge; FALSE otherwise.

5. Visualizations (Figures)

The following PNG files provide visual validation of the metrics described in the tabular files:

Fig_ContigEdge_Genes_Clean.png: Distribution of genes relative to scaffold ends.
Fig_ExonLength_Log.png: Log-transformed distribution of exon lengths.
Fig_ExonLength_p995.png: Distribution of exon lengths (99.5th percentile).
Fig_Exons_per_Transcript_Clean.png: Frequency of exons per transcript model.
Fig_Gene_Density_per_Seq_Primary_Clean.png: Global gene density per scaffold.
Fig_GeneDensity_1Mb_Faceted_Clean.png: Faceted view of gene density across windows.
Fig_GeneLength_ByTop10Seq.png: Gene length comparisons across the 10 largest scaffolds.
Fig_GeneLength_Distribution_p995.png: Global distribution of gene lengths.
Fig_GeneOverlaps_Clean.png: Visualization of overlapping gene models.
Fig_Genes_per_Seq_Top50.png: Gene counts for the 50 largest scaffolds.
Fig_Intergenic_Dist_p995.png: Distribution of distances between adjacent genes.
Fig_IntronLength_Log.png: Log-scaled distribution of intron lengths.
Fig_IntronLength_p995.png: Distribution of intron lengths (99.5th percentile).
Fig_Isoform_Classes.png: Classification of transcript isoforms.
Fig_Lorenz_Genes_Clean.png: Lorenz curve showing the distribution of genes across scaffolds.
Fig_Multicopy_Top25_Lollipop.png: Top 25 most frequent multicopy gene families.
Fig_RankSize_GeneCounts_Clean.png: Rank-size distribution of gene counts per scaffold.
Fig_Transcripts_per_Gene_Clipped.png: Count of transcripts per gene (clipped to show distribution).

6. Code/Software

Software Needed to View Data:

Tabular Data: Text editors or LibreOffice Calc.
Annotation (.gtf): Integrative Genomics Viewer (IGV) or JBrowse 2.

Software Used for Analysis:

BRAKER3 (v3.0.8): Prediction pipeline using RNA-seq (SRR28920143, SRR8388975, SRR2183423) and OrthoDB v12.
R (v2025.05.1+513): Data processing, statistical analysis, and figure generation.
R Packages: tidyverse (dplyr, ggplot2), GenomicRanges.

Workflow: BRAKER3 generated the primary GTF. R was then used to parse the GTF, calculate structural statistics (lengths, densities, distances), and generate the validation figures listed above.

Annotation of the novel humpback whale (Megaptera novaeangliae) reference genome

Data files

Abstract

1. General Information

2. Terminology and Units for New Users

3. Primary Data Record: Megaptera_novaeangliae_annotated.gtf

4. Derived Tabular Data (CSV and TXT)

00_summary.txt

01_multicopy_gene_families.csv

02_transcripts_per_gene_derived.csv

03_exons_per_transcript.csv

04_genes_per_seq_all.csv

05_gene_density_per_seq.csv

06_intergenic_distances_primary.csv

07_gene_density_1Mb_windows.csv

08_tandem_duplication_clusters.csv

09_qc_cds_mod3.txt

exon_lengths.csv / intron_lengths.csv

overlapping_gene_pairs.csv

genes_near_scaffold_edges.csv

5. Visualizations (Figures)

6. Code/Software

Annotation of the novel humpback whale (Megaptera novaeangliae) reference genome

Data files

Abstract

README: README: Genome Annotation and Structural Validation of the Humpback Whale (Megaptera novaeangliae)

1. General Information

2. Terminology and Units for New Users

3. Primary Data Record: Megaptera_novaeangliae_annotated.gtf

4. Derived Tabular Data (CSV and TXT)

00_summary.txt

01_multicopy_gene_families.csv

02_transcripts_per_gene_derived.csv

03_exons_per_transcript.csv

04_genes_per_seq_all.csv

05_gene_density_per_seq.csv

06_intergenic_distances_primary.csv

07_gene_density_1Mb_windows.csv

08_tandem_duplication_clusters.csv

09_qc_cds_mod3.txt

exon_lengths.csv / intron_lengths.csv

overlapping_gene_pairs.csv

genes_near_scaffold_edges.csv

5. Visualizations (Figures)

6. Code/Software