Annotation of the novel humpback whale (Megaptera novaeangliae) reference genome
Data files
Jan 12, 2026 version files 26.34 MB
-
00_summary.txt
94 B
-
01_multicopy_gene_families.csv
22.15 KB
-
02_transcripts_per_gene_derived.csv
330.95 KB
-
03_exons_per_transcript.csv
605.15 KB
-
04_genes_per_seq_all.csv
14.85 KB
-
05_gene_density_per_seq.csv
31.55 KB
-
06_intergenic_distances_primary.csv
1.72 MB
-
07_gene_density_1Mb_windows.csv
69.03 KB
-
08_tandem_duplication_clusters.csv
652.43 KB
-
09_qc_cds_mod3.txt
43 B
-
exon_lengths.csv
15.50 MB
-
genes_near_scaffold_edges.csv
2.38 MB
-
intron_lengths.csv
4.80 MB
-
overlapping_gene_pairs.csv
200.46 KB
-
README.md
6.62 KB
Abstract
We present a comprehensive annotation of the humpback whale (Megaptera novaeangliae) genome, representing the most complete and biologically coherent gene-model resource available to date for this species. Leveraging R (v2025.05.1+513) within a fully reproducible analysis pipeline, we evaluated 1,308,486 genomic features—including 21,833 protein-coding genes and 386,765 annotated exons. The structural features of our annotation are consistent with mammalian expectations (median gene length ≈ 4.3 kb; intron lengths peaking in the 1–10 kb range) and show coding‐sequence integrity (99.7 % of transcripts are in-frame). Gene loci are distributed non‐uniformly across scaffolds, with densities ranging from ~7 to ~15 genes per megabase, covering ~25–30 % of the assembled genome sequence. We further identified ~2,000 multi‐copy gene families and numerous tandem duplication clusters (2–12 members). These results demonstrate that the M. novaeangliae genome is both structurally complete and functionally rich, providing a robust foundation for future comparative, physiological, and conservation‐genomics investigations.
1. General Information
- Submission Title: Genome annotation and structural validation of the humpback whale Megaptera novaeangliae
- Contact Information: Maria-Vittoria G. Carminati (m.carminati@deepbiotech.com)
- Dataset Overview: This dataset contains the primary Gene Transfer Format (GTF) annotation and derived validation files for the Megaptera novaeangliae genome. These files provide the evidence for the quality and structure of 21,833 protein-coding genes and 386,765 exons.
2. Terminology and Units for New Users
- Base pair (bp): The fundamental unit of DNA length. All genomic coordinates and lengths in this dataset are in bp.
- Megabase (Mb): One million base pairs.
- Scaffold / seqnames: Unique identifiers for the DNA sequence fragments in the genome assembly.
- Isoform / Transcript: Alternative versions of mRNA produced from a single gene.
- Intergenic Space: The DNA sequence located between two adjacent genes.
- Coding Sequence (CDS): The portion of a gene that codes for protein translation.
3. Primary Data Record: Megaptera_novaeangliae_annotated.gtf
- Description: The primary structural annotation file describing the location of genes and exons.
- Format: Tab-separated (GTF 2.2).
- Variables (Columns):
- seqname: Name of the genomic scaffold.
- source: Prediction software (BRAKER3).
- feature: Element type (
gene,transcript,exon,CDS,start_codon,stop_codon). - start / end: 1-based coordinates in bp.
- score: Confidence value (represented as "." if not applicable).
- strand: Orientation on the DNA (
+or-). - frame: Reading frame (
0,1,2, or.). - attribute: Semicolon-separated tags including
gene_id,transcript_id, andgene_name.
4. Derived Tabular Data (CSV and TXT)
00_summary.txt
Summary of total gene and transcript counts.
01_multicopy_gene_families.csv
gene_name: (String) Standardized name of the gene family.n: (Integer) Total count of gene copies identified.
02_transcripts_per_gene_derived.csv
gene_id: (String) Unique gene identifier.n_tx: (Integer) Number of isoforms per gene.
03_exons_per_transcript.csv
transcript_id: (String) Unique transcript identifier.n_exons: (Integer) Total exons in the transcript.
04_genes_per_seq_all.csv
seqnames: (String) Scaffold name.n_genes: (Integer) Total genes on the scaffold.
05_gene_density_per_seq.csv
seqnames: (String) Scaffold name.approx_len: (Integer) Length of scaffold in bp.n_genes: (Integer) Total gene count.gene_density_perMb: (Float) Density expressed as (n_genes / approx_len * 1,000,000).
06_intergenic_distances_primary.csv
gene_id: (String) Gene ID (orNAfor scaffold start regions).seqnames: (String) Scaffold name.start/end: (Integer) Coordinates in bp.intergenic: (Integer) Distance to the preceding gene in bp.
07_gene_density_1Mb_windows.csv
seqnames: (String) Scaffold name.bin: (Integer) The starting coordinate of the 1,000,000 bp window.genes_per_Mb: (Integer) Gene count within that specific window.
08_tandem_duplication_clusters.csv
gene_name: (String) Name of the duplicated gene family.cluster: (Integer) Numerical ID of the cluster.size: (Integer) Number of genes in the cluster.span_bp: (Integer) Total genomic length covered by the cluster in bp.
09_qc_cds_mod3.txt
Technical report verifying that Coding Sequence (CDS) lengths are multiples of 3.
exon_lengths.csv / intron_lengths.csv
exon_len/intron_len: (Integer) Length of the individual feature in bp.
overlapping_gene_pairs.csv
geneA/geneB: (String) Overlapping Gene IDs.pair_type: (Category)same-strandoropposite-strand.
genes_near_scaffold_edges.csv
dist_to_end: (Integer) Distance to the nearest scaffold end in bp.near_edge: (Boolean)TRUEif the gene is at the edge;FALSEotherwise.
5. Visualizations (Figures)
The following PNG files provide visual validation of the metrics described in the tabular files:
- Fig_ContigEdge_Genes_Clean.png: Distribution of genes relative to scaffold ends.
- Fig_ExonLength_Log.png: Log-transformed distribution of exon lengths.
- Fig_ExonLength_p995.png: Distribution of exon lengths (99.5th percentile).
- Fig_Exons_per_Transcript_Clean.png: Frequency of exons per transcript model.
- Fig_Gene_Density_per_Seq_Primary_Clean.png: Global gene density per scaffold.
- Fig_GeneDensity_1Mb_Faceted_Clean.png: Faceted view of gene density across windows.
- Fig_GeneLength_ByTop10Seq.png: Gene length comparisons across the 10 largest scaffolds.
- Fig_GeneLength_Distribution_p995.png: Global distribution of gene lengths.
- Fig_GeneOverlaps_Clean.png: Visualization of overlapping gene models.
- Fig_Genes_per_Seq_Top50.png: Gene counts for the 50 largest scaffolds.
- Fig_Intergenic_Dist_p995.png: Distribution of distances between adjacent genes.
- Fig_IntronLength_Log.png: Log-scaled distribution of intron lengths.
- Fig_IntronLength_p995.png: Distribution of intron lengths (99.5th percentile).
- Fig_Isoform_Classes.png: Classification of transcript isoforms.
- Fig_Lorenz_Genes_Clean.png: Lorenz curve showing the distribution of genes across scaffolds.
- Fig_Multicopy_Top25_Lollipop.png: Top 25 most frequent multicopy gene families.
- Fig_RankSize_GeneCounts_Clean.png: Rank-size distribution of gene counts per scaffold.
- Fig_Transcripts_per_Gene_Clipped.png: Count of transcripts per gene (clipped to show distribution).
6. Code/Software
Software Needed to View Data:
- Tabular Data: Text editors or LibreOffice Calc.
- Annotation (.gtf): Integrative Genomics Viewer (IGV) or JBrowse 2.
Software Used for Analysis:
- BRAKER3 (v3.0.8): Prediction pipeline using RNA-seq (SRR28920143, SRR8388975, SRR2183423) and OrthoDB v12.
- R (v2025.05.1+513): Data processing, statistical analysis, and figure generation.
- R Packages:
tidyverse(dplyr,ggplot2),GenomicRanges.
Workflow: BRAKER3 generated the primary GTF. R was then used to parse the GTF, calculate structural statistics (lengths, densities, distances), and generate the validation figures listed above.
