Skip to main content

Annelid comparative genomics and the evolution of massive lineage-specific genome rearrangement in bilaterians

Cite this dataset

Lewin, Thomas; Liao, Isabel Jiah-Yih; Luo, Yi-Jyun (2024). Annelid comparative genomics and the evolution of massive lineage-specific genome rearrangement in bilaterians [Dataset]. Dryad.


The organization of genomes into chromosomes is critical for processes such as genetic recombination, environmental adaptation, and speciation. All animals with bilateral symmetry inherited a genome structure from their last common ancestor that has been highly conserved in some taxa but seemingly unconstrained in others. However, the evolutionary forces driving these differences and the processes by which they emerge have remained largely uncharacterized. Here we analyze genome organization across the phylum Annelida using 23 chromosome-level annelid genomes. We find that while most annelids have maintained the conserved bilaterian genome structure, a group containing leeches and earthworms possesses completely scrambled genomes. We develop a rearrangement index to quantify the extent of genome structure evolution and show leeches and earthworms to have the most highly rearranged genomes of any currently sampled bilaterian. We further show that bilaterian genomes can be classified into two distinct categories—high and low rearrangement—largely influenced by the presence or absence, respectively, of chromosome fission events. Our findings demonstrate that animal genome structure can be highly variable within a phylum and reveal that genome rearrangement can occur both in a gradual, stepwise fashion or as rapid, all-encompassing changes over short evolutionary timescales.

README: Annelid comparative genomics and the evolution of massive lineage-specific genome rearrangement in bilaterians

Description of the data and file structure

Gene models for 23 annelid species are available in GTF (.gtf), coding sequence (.fasta), and amino acid (.faa) formats. These 23 species are listed below:

  • Acholoe squamosa
  • Alentia gelatinosa
  • Alitta virens
  • Amphiduros pacificus
  • Aporrectodea icterica
  • Bimastos eiseni
  • Brachipolynoe longqiensis
  • Branchellion lobata
  • Harmothoe impar
  • Hirudinaria manillensis
  • Lamellibrachia columna
  • Lepidonotus clava
  • Lumbricus rubellus
  • Lumbricus terrestris
  • Metaphire vulgaris
  • Paraescarpia echinospica
  • Piscicola geometra
  • Protula sp. h YS2021
  • Sipunculus nudus
  • Sthenelais limicola
  • Streblospio benedicti
  • Terebella lapidaria
  • Urechis unicinctus


This study aimed to characterize interchromosomal rearrangements within the phylum Annelida. All available chromosome-level assemblies of annelid species (n = 24) were obtained from the National Center for Biotechnology Information (NCBI) using NCBI Datasets on February 1st, 2024. Of the 24 genomes, 16 were produced by the Darwin Tree of Life (DToL) sequencing project (The Darwin Tree of Life Project Consortium et al. 2022). The genome assemblies from the DToL project have been made publicly available to the community for further analysis. Those with an accompanying publication are Acholoe squamosa (Adkins et al. 2023), Alitta virens (Fletcher et al. 2023), Lepidonotus clava (Darbyshire et al. 2022), Piscicola geometra (Doe et al. 2023), and Sthenelais limicola (Darbyshire et al. 2023). Genomes from other sources with accompanying publications are: Branchipolynoe longqiensis (He et al. 2023), Hirudinaria manillensis (Liu et al. 2023), Metaphire vulgaris (Jin et al. 2020), Owenia fusiformis (Martín-Zamora et al. 2023), Paraescarpia echinospica (Sun et al. 2021), Streblospio benedicti (Zakas et al. 2022), Sipunculus nudus (Zheng et al. 2023), and Urechis unicinctus (Cheng et al. 2024).

One species, O. fusiformis, had available GenBank gene annotations. Gene prediction for the remaining 23 species was performed using RepeatModeler2 (v2.0.4) (Flynn et al. 2020), RepeatMasker (v4.1.5) (Smit et al. 2015), and the BRAKER3 pipeline (v3.0.3) (Stanke et al. 2006; Stanke et al. 2008; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Buchfink et al. 2015; Hoff et al. 2016; Hoff et al. 2019; Brůna et al. 2021) as reported previously (Lewin et al. 2024). For species with available RNA-seq data (supplementary table S10), reads were trimmed with fastp (v0.23.4) (Chen et al. 2018) and mapped with STAR (v2.7.10b) (Dobin et al. 2013) before BRAKER3 was run in RNA-seq mode. For species with no RNA-seq data, BRAKER3 was run in protein mode using the supplied Metazoa.fa protein file. Gene prediction quality was assessed using BUSCO (v5.4.7) (Simão et al. 2015). 


Royal Society, Award: NIF\R1\201315

Academia Sinica, Award: AS-CDA-112-L06