Data from: A phased chromosome-level genome of the annelid tubeworm Galeolaria caespitosa

Van Dorssen, Monique 1 ; Belcher, Emily K.1; Gallegos, Cristóbal1; Monro, Keyne1; Hodgins, Kathryn A.1

Published Sep 11, 2025 on Dryad. https://doi.org/10.5061/dryad.x95x69pwh

Data files

Sep 11, 2025 version files 511.52 MB

Galeolaria_caespitosa_ChromAssemblyHap1_Mito.fasta.gz

254.10 MB
Galeolaria_caespitosa_ChromAssemblyHap2.fasta.gz

244.23 MB
Galeolaria_caespitosa_H1_genome_annotation.gff3.gz

7.24 MB
Galeolaria_caespitosa_H2_genome_annotation.gff3.gz

5.95 MB
README.md

5.26 KB

Abstract

Haplotype-resolved (phased) genome assemblies are emerging as important assets for genomic studies of species with high heterozygosity, but remain lacking for key animal lineages. Here, we use PacBio HiFi and Omni-C technologies to assemble the first phased, annotated, chromosome-level genome for any annelid: the reef-building tubeworm Galeolaria caespitosa (Serpulidae). The assembly is 803.5 Mbp long (scaffold N50 = 76.5 Mbp) for haplotype 1 and 789.3 Mbp long (scaffold N50 = 75.4 Mbp) for haplotype 2, which are arranged into 11 pairs of chromosomes showing no sign of sex chromosomes. This compares with cytological analyses reporting 12–13 pairs in G. caespitosa’s closest relatives, including species that are protandrous hermaphrodites. We combined long-read and short-read transcriptome sequencing to annotate both haplotypes, resulting in 30,495 predicted proteins for haplotype 1, 27,423 proteins for haplotype two, and 79.5% of proteins with at least one functional annotation. We also assembled a mitochondrial genome 23 Kbp long, annotating all genes typically found in mitochondrial DNA apart from those coding the 16S ribosomal subunit (rrnL) and the protein atp8 — a short, fast-evolving mitochondrial gene missing in other metazoans. Comparing G. caespitosa’s genome to those of three other annelids reveals limited collinearity despite 36.0% of shared orthologous gene clusters (4,238 of 11,763 clusters counted in G. caespitosa), suggesting extensive chromosomal rearrangements among lineages. New high-quality annelid genomes may help resolve the genetic and evolutionary basis of this diversity.

Dataset DOI: 10.5061/dryad.x95x69pwh

Description of the Data

Haplotype-resolved (phased) genome assemblies are emerging as important assets for genomic studies of species with high heterozygosity; however, they remain lacking for key animal lineages. Here, we use PacBio HiFi long-read sequencing and Omni-C scaffolding to assemble the first phased, annotated, chromosome-level genome for any annelid: the reef-building tubeworm Galeolaria caespitosa (Serpulidae).

Assembly statistics:

Haplotype 1: 803.5 Mbp, scaffold N50 = 76.5 Mbp
Haplotype 2: 789.3 Mbp, scaffold N50 = 75.4 Mbp

Both haplotypes are arranged into 11 pairs of chromosomes, with no evidence of sex chromosomes.

Gene models were annotated using funannotate, integrating ab initio predictors, transcriptome alignments, and homology-based evidence (PFAM, InterPro, EggNog, COG).

Files and variables

Genome Assemblies

File: Galeolaria_caespitosa_ChromAssemblyHap1_Mito.fasta.gz

Description: Haplotype 1 assembly, this fasta file contains 11 chromosomes (labeled chromosome_1 to chromosome_11), the mitochondrial genome (labeled mitochondrion), and 570 unplaced scaffolds (labelled scaffold_13 to scaffold_583).

File: Galeolaria_caespitosa_ChromAssemblyHap2.fasta.gz

Description: Haplotype 1 assembly, this fasta file contains 11 chromosomes (labeled chromosome_1 to chromosome_11) and 278 unplaced scaffolds (labelled scaffold_12 to scaffold_289).

Repeat regions are soft-masked in both assemblies.

These data provide a chromosome-level, haplotype-resolved genomic resource for Galeolaria caespitosa, enabling studies in molecular ecology, adaptation, and genome evolution in annelids.

Gene Annotations

File: Galeolaria_caespitosa_H1_genome_annotation.gff3.gz

Description: Annotation file for haplotype 1.

File: Galeolaria_caespitosa_H2_genome_annotation.gff3.gz

Description: Annotation file for haplotype 2.

Annotation files include:

gene features (coordinates, identifiers, locus tags).
mRNA transcripts (with predicted products, GO terms, and database cross-references).
CDS coding sequence regions linked to predicted proteins.
protein_id, product, note, and Dbxref fields for functional interpretation.
Genes annotated as “hypothetical protein” lack strong functional assignments but were predicted based on gene structure evidence.

Both files are in GFF3 (General Feature Format version 3), which is a standardized format for describing genomic features. These files have 9 columns:

Example Entry (from Hap1 annotation)

chromosome_1  funannotate  gene  310900  311955  .  +  .  ID=galc_g2; locus_tag=ABL822_00002; gene=NACHRB1;

chromosome_1  funannotate  mRNA  310900  311955  .  +  .  ID=galc_g2.t1; Parent=galc_g2; product=Neurotransmitter-gated ion-channel transmembrane region; Ontology_term=GO:0007268,GO:0042391; Dbxref=PFAM:PF02932,InterPro:IPR006029;

chromosome_1  funannotate  CDS   310900  311556  .  +  0  ID=galc_g2.t1.cds1; Parent=galc_g2.t1; protein_id=t1.ABL822_00002;

Methods Summary

We assembled the Galeolaria caespitosa genome from a wild male collected in Victoria, Australia. High-molecular-weight DNA was extracted from sperm and sequenced with PacBio HiFi (Sequel II) for contig generation, and an Omni-C library from muscle tissue was sequenced on the Illumina HiSeq X for scaffolding.

Assembly

Assembler: HiFiasm v0.15.4 (Hi-C mode, phased assembly)
Scaffolding: YaHS v1.1 with Omni-C reads
Quality control: QUAST v5.0.2, BUSCO v5.1.3 (eukaryota_odb10), Merqury v1.3
Chromosome assignment: chromatin contact maps (Juicebox v1.9.8), k-means clustering of scaffold lengths, and minimap2 alignments between haplotypes
Final assembly: 11 chromosome pairs, haplotype 1 = 803.5 Mb (N50 = 76.5 Mb), haplotype 2 = 789.3 Mb (N50 = 75.4 Mb)

Mitochondrial Genome

Extracted from haplotype 1 with MitoHiFi v3.2.1, annotated with MITOS v2.1.0 and MitoFinder v1.4.0, and validated by comparison with other serpulid mitogenomes.

Transcriptome Data

Long-read Iso-Seq: pooled adult and developmental samples, sequenced on PacBio Sequel IIe
Short-read RNA-Seq: 91 embryonic and larval samples sequenced on Illumina NovaSeq X Plus
Pre-processing: Iso-Seq reads processed with lima v2.9.0 and isoseq3 v4.0.0; RNA-Seq reads trimmed with fastp v0.20.0

Annotation

Repeat modeling/masking: RepeatModeler2 v2.0.5, RepeatMasker v4.1.1, ProtExcluder v1.2
Gene prediction: BRAKER3 pipeline integrating RNA-Seq, Iso-Seq, and UniProtKB annelid proteins, plus GeneMarkS-T and TSEBRA for evidence integration
Post-processing: Conversion and cleaning with AGAT v1.0.0 and GFFtk v24.2.4; low-expression models filtered with edgeR
Functional annotation: InterProScan v5.61, EggNog-mapper v2.1.10, funannotate v1.8.14, Diamond-BLASTp v1.19, merged with gffutils v0.13

Code/software

Code is available on GitHub: https://github.com/moniquevdor/GaleolariaReferenceGenome