Data from: A phased chromosome-level genome of the annelid tubeworm Galeolaria caespitosa
Data files
Sep 11, 2025 version files 511.52 MB
Abstract
Haplotype-resolved (phased) genome assemblies are emerging as important assets for genomic studies of species with high heterozygosity, but remain lacking for key animal lineages. Here, we use PacBio HiFi and Omni-C technologies to assemble the first phased, annotated, chromosome-level genome for any annelid: the reef-building tubeworm Galeolaria caespitosa (Serpulidae). The assembly is 803.5 Mbp long (scaffold N50 = 76.5 Mbp) for haplotype 1 and 789.3 Mbp long (scaffold N50 = 75.4 Mbp) for haplotype 2, which are arranged into 11 pairs of chromosomes showing no sign of sex chromosomes. This compares with cytological analyses reporting 12–13 pairs in G. caespitosa’s closest relatives, including species that are protandrous hermaphrodites. We combined long-read and short-read transcriptome sequencing to annotate both haplotypes, resulting in 30,495 predicted proteins for haplotype 1, 27,423 proteins for haplotype two, and 79.5% of proteins with at least one functional annotation. We also assembled a mitochondrial genome 23 Kbp long, annotating all genes typically found in mitochondrial DNA apart from those coding the 16S ribosomal subunit (rrnL) and the protein atp8 — a short, fast-evolving mitochondrial gene missing in other metazoans. Comparing G. caespitosa’s genome to those of three other annelids reveals limited collinearity despite 36.0% of shared orthologous gene clusters (4,238 of 11,763 clusters counted in G. caespitosa), suggesting extensive chromosomal rearrangements among lineages. New high-quality annelid genomes may help resolve the genetic and evolutionary basis of this diversity.
Dataset DOI: 10.5061/dryad.x95x69pwh
Description of the Data
Haplotype-resolved (phased) genome assemblies are emerging as important assets for genomic studies of species with high heterozygosity; however, they remain lacking for key animal lineages. Here, we use PacBio HiFi long-read sequencing and Omni-C scaffolding to assemble the first phased, annotated, chromosome-level genome for any annelid: the reef-building tubeworm Galeolaria caespitosa (Serpulidae).
Assembly statistics:
- Haplotype 1: 803.5 Mbp, scaffold N50 = 76.5 Mbp
- Haplotype 2: 789.3 Mbp, scaffold N50 = 75.4 Mbp
Both haplotypes are arranged into 11 pairs of chromosomes, with no evidence of sex chromosomes.
Gene models were annotated using funannotate, integrating ab initio predictors, transcriptome alignments, and homology-based evidence (PFAM, InterPro, EggNog, COG).
Files and variables
Genome Assemblies
File: Galeolaria_caespitosa_ChromAssemblyHap1_Mito.fasta.gz
Description: Haplotype 1 assembly, this fasta file contains 11 chromosomes (labeled chromosome_1 to chromosome_11), the mitochondrial genome (labeled mitochondrion), and 570 unplaced scaffolds (labelled scaffold_13 to scaffold_583).
File: Galeolaria_caespitosa_ChromAssemblyHap2.fasta.gz
Description: Haplotype 1 assembly, this fasta file contains 11 chromosomes (labeled chromosome_1 to chromosome_11) and 278 unplaced scaffolds (labelled scaffold_12 to scaffold_289).
Repeat regions are soft-masked in both assemblies.
These data provide a chromosome-level, haplotype-resolved genomic resource for Galeolaria caespitosa, enabling studies in molecular ecology, adaptation, and genome evolution in annelids.
Gene Annotations
File: Galeolaria_caespitosa_H1_genome_annotation.gff3.gz
Description: Annotation file for haplotype 1.
File: Galeolaria_caespitosa_H2_genome_annotation.gff3.gz
Description: Annotation file for haplotype 2.
Annotation files include:
- gene features (coordinates, identifiers, locus tags).
- mRNA transcripts (with predicted products, GO terms, and database cross-references).
- CDS coding sequence regions linked to predicted proteins.
- protein_id, product, note, and Dbxref fields for functional interpretation.
- Genes annotated as “hypothetical protein” lack strong functional assignments but were predicted based on gene structure evidence.
Both files are in GFF3 (General Feature Format version 3), which is a standardized format for describing genomic features. These files have 9 columns:
Example Entry (from Hap1 annotation)
chromosome_1 funannotate gene 310900 311955 . + . ID=galc_g2; locus_tag=ABL822_00002; gene=NACHRB1;
chromosome_1 funannotate mRNA 310900 311955 . + . ID=galc_g2.t1; Parent=galc_g2; product=Neurotransmitter-gated ion-channel transmembrane region; Ontology_term=GO:0007268,GO:0042391; Dbxref=PFAM:PF02932,InterPro:IPR006029;
chromosome_1 funannotate CDS 310900 311556 . + 0 ID=galc_g2.t1.cds1; Parent=galc_g2.t1; protein_id=t1.ABL822_00002;
Methods Summary
We assembled the Galeolaria caespitosa genome from a wild male collected in Victoria, Australia. High-molecular-weight DNA was extracted from sperm and sequenced with PacBio HiFi (Sequel II) for contig generation, and an Omni-C library from muscle tissue was sequenced on the Illumina HiSeq X for scaffolding.
Assembly
- Assembler: HiFiasm v0.15.4 (Hi-C mode, phased assembly)
- Scaffolding: YaHS v1.1 with Omni-C reads
- Quality control: QUAST v5.0.2, BUSCO v5.1.3 (eukaryota_odb10), Merqury v1.3
- Chromosome assignment: chromatin contact maps (Juicebox v1.9.8), k-means clustering of scaffold lengths, and minimap2 alignments between haplotypes
- Final assembly: 11 chromosome pairs, haplotype 1 = 803.5 Mb (N50 = 76.5 Mb), haplotype 2 = 789.3 Mb (N50 = 75.4 Mb)
Mitochondrial Genome
Extracted from haplotype 1 with MitoHiFi v3.2.1, annotated with MITOS v2.1.0 and MitoFinder v1.4.0, and validated by comparison with other serpulid mitogenomes.
Transcriptome Data
- Long-read Iso-Seq: pooled adult and developmental samples, sequenced on PacBio Sequel IIe
- Short-read RNA-Seq: 91 embryonic and larval samples sequenced on Illumina NovaSeq X Plus
- Pre-processing: Iso-Seq reads processed with lima v2.9.0 and isoseq3 v4.0.0; RNA-Seq reads trimmed with fastp v0.20.0
Annotation
- Repeat modeling/masking: RepeatModeler2 v2.0.5, RepeatMasker v4.1.1, ProtExcluder v1.2
- Gene prediction: BRAKER3 pipeline integrating RNA-Seq, Iso-Seq, and UniProtKB annelid proteins, plus GeneMarkS-T and TSEBRA for evidence integration
- Post-processing: Conversion and cleaning with AGAT v1.0.0 and GFFtk v24.2.4; low-expression models filtered with edgeR
- Functional annotation: InterProScan v5.61, EggNog-mapper v2.1.10, funannotate v1.8.14, Diamond-BLASTp v1.19, merged with gffutils v0.13
Code/software
Code is available on GitHub: https://github.com/moniquevdor/GaleolariaReferenceGenome
