Genomic insights into the chromosomal elongation in a family of Collembola
Data files
Jul 27, 2023 version files 227.02 MB
Abstract
Collembola is a highly diverse and abundant group of soil arthropods with chromosome numbers ranging from 5 to 11. Previous karyotype studies indicated that the Tomoceridae family possesses an exceptionally long chromosome. To better understand chromosome size evolution in Collembola, we obtained a chromosome-level genome of Yoshiicerus persimilis with a size of 334.44 Mb and BUSCO completeness of 97.0% (n = 1,013). Both genomes of Y. persimilis and Tomocerus qinae (recently published) have an exceptionally large chromosome (ElChr >100 Mb), accounting for nearly one-third of the genome. Comparative genomic analyses suggest that chromosomal elongation occurred independently in two species approximately 10 million years ago, rather than in the ancestor of the Tomoceridae family. The ElChr elongation was caused by large tandem and segmental duplications, as well as transposon proliferation, with genes in these regions experiencing weaker purifying selection (higher dN/dS) than conserved regions. Moreover, inter-genomic synteny analyses indicated that chromosomal fission/fusion events played a crucial role in the evolution of chromosome numbers (ranging from 5 to 7) within Entomobryomorpha. This study provides a valuable resource for investigating the chromosome evolution of Collembola.
README
genome.fa.masked.gz Repeat-masked genome assembly
repeats.gff.gz Repeat annotation
iprscan.tsv.gz InterProscan results
eggnog.emapper.annotations.gz eggNOG annotation results
gene.maker.gff.gz Annotation file of MAKER-annotated protein-coding genes
cds.maker.fasta.gz Coding sequences of MAKER-annotated protein-coding genes
proteins.maker.fasta.gz Amino-acid sequences of MAKER-annotated protein-coding genes
transcripts.maker.fasta.gz Transcripts of MAKER-annotated protein-coding genes
Methods
Genome assembly
De novo assembly of PacBio long reads was performed by Raven v. 1.6.0. The assembly was then polished with one round of long reads using Flye v. 2.8.3 and two rounds of Illumina short reads using NextPolish v. 1.3.1. Primary contigs were anchored into chromosomes using 3D-DNA v. 180922.
Genome annotation
We used the MAKER v. 3.01.03 to predict PCGs, which integrates ab initio, RNA-seq, and protein homology evidence. BRAKER v. 2.1.6 and GeMoMa v. 1.7.1 predictions combining protein and transcriptome evidence were integrated as the ab initio input passed to MAKER. BRAKER trained Augustus v. 3.3.4 and GeneMark-ES/ET/EP 4.68_lic integrating evidence from the OrthoDB10 v1 database. GeMoMa with parameters “GeMoMa.c = 0.3 GeMoMa.p = 12” utilized eight species (Daphnia magna, Cloeon dipterum, Zootermopsis nevadensis, Drosophila melanogaster, Rhopalosiphum maidis, Tribolium castaneum, Sinella curviseta, and FCSH) as the protein homology-based reference. RNA-seq alignments were produced using HISAT2 v. 2.2.0. RNA-seq data were further assembled into transcripts with the genome-guided assembler StringTie v. 2.1.6. MAKER used the protein sequences from the aforementioned eight species as protein homology evidence.
PCGs were annotated by aligning protein sequences to the UniProtKB database using Diamond v. 2.0.8 with an e-value threshold of 1e-5. Furthermore, protein domains were predicted by InterProScan 5.48–83.0 based on five public databases: Pfam, SMART, Superfamily, Gene3D, and CDD. EggNOG-mapper v. 2.1.5 was also employed for functional category annotation based on the eggNOG v. 5.0.2 database.
Usage notes
The sequencing reads are deposited at NCBI (SRR13480398–SRR13480401 and SRR25299242) under BioProject PRJNA630033. The genome assembly is deposited at GenBank under accessions JABJWA000000000. Additionally, the results of annotation for repeated sequences, gene structure, and functional prediction have been deposited in Figshare (https://doi.org/10.6084/m9.figshare.23722086).