A high-quality genome of the dobsonfly Neoneuromus ignobilis reveals molecular convergences in aquatic insects

Ma, Xing-Zhou 1 ; Wang, Zi-Qi1; Ye, Xi-Qian1; Liu, Xing-Yue2; Tang, Pu1; Shen, Xingxing1; Chen, Xue-Xin1

Published Aug 11, 2022 on Dryad. https://doi.org/10.5061/dryad.j9kd51cg8

Data files

Aug 11, 2022 version files 446.02 MB

NIG_curated_v1.5.1.zip

141.41 MB
NIG_miRNA.gff3

8.24 KB
NIG_mtDNA.fa

15.78 KB
NIG_mtDNA.gff

2.20 KB
NIG_OGS_V1.8.gff3

37.69 MB
NIG_TE.gff

175.21 MB
NIG_trf.gff

91.69 MB
README.md

1.49 KB

Abstract

Neoneuromus ignobilis is an archaic holometabolous aquatic predatory insect. However, a lack of genomic resources hinders the use of whole genome sequencing to explore their genetic basis and molecular mechanisms for adaptive evolution. Here, we provided a high-contiguity, chromosome-level genome assembly of N. ignobilis using high coverage nanopore reads and the Hi-C technique. The final assembly is 481.43 MB in size, containing 12 telomere-ended pseudochromosomes with only 23 gaps. We then compared 42 hexapod species genomes including six independent lineages comprising 11 aquatic insects, and found convergent expansions of long wavelength-sensitive and blue-sensitive opsins, thermal stress response TRP channels, and sulfotransferases in aquatic insects, which may be related to their aquatic adaptation. We also detected strong non-random signals of convergent amino acid substitutions in aquatic insects. Collectively, our comparative genomic analysis revealed the evidence of molecular convergences in aquatic insects during both gene family evolution and convergent amino acid substitutions.

Samples collection and sequencing

For genome sequencing, N. ignobilis samples used for genome sequencing were collected in Gutian Mountain (Zhejiang, China) from 2019 to 2021. One adult female was used for PacBio CLR sequencing. We removed the abdomen before DNA extraction to reduce the microbe contaminations. DNA extraction, library construction, and sequencing were conducted by Novogene (Beijing, China). A total of 42.3 Gb PacBio CLR subreads (87x, reads length N50=16,964nt) were produced. Short reads (42.4 Gb, 87x) were also sequenced from the same individual for polishing and genome survey using the Illumina Hiseq platform. Another female was kept in the laboratory until egg laying. The eggs were incubated at 25°C, and the newly hatched larvae were sent to NextOmics Biosciences (Wuhan, China) for nanopore, Illumina, and Hi-C sequencing. A total of 54.3 Gb ultra-long Oxford nanopore (ONT) reads (112x, reads length N50=43,487nt) were generated by a PromethION sequencer by R9.4 flow cells. The base-calling was performed by Guppy. Hi-C library (digested by DpnII) was sequenced via the Illumina Novaseq, and a total of 39.9Gb (82x) HiC-reads were produced. Short reads (35.2 Gb 75x) of newly hatched larvae were also generated from an Illumina Novaseq platform for polishing.

Female adults collected in Tianmu Mountain (Zhejiang, China) during 2020 were used for transcriptome sequencing. One full-length cDNA library was constructed and sequenced by a PacBio Sequel platform. A total of 20.87 Gb long polymerase reads were produced. We used the whole-body insect for RNA-seq. One RNA-seq and one small RNA-seq library were sequenced by the Illumina platform, obtaining 6.56 Gb short RNA-seq reads and 11,324,628 small RNA reads. Novogene (Beijing, China) carried out the RNA extraction, library construction, and sequencing of these libraries. In addition, two RNA-seq libraries from newly hatched larvae and one large size larva (5cm in length, collected in Gutian Mountain, identified using COI) were sequenced using MGI-SEQ 2000 platform, producing 19.76 Gb and 17.08 Gb short RNA-seq reads. RNA extraction, library construction, and sequencing were conducted by NextOmics Biosciences (Wuhan, China).

Genome assemblies

We first obtain the mitochondrial genome of N. ignobilis. We downloaded the mitochondrial genome of Tribolium castaneum from NCBI (NC_003081.2) as the reference. The PacBio CLR reads were aligned to the T. castaneum mitochondria using Minimap2 v2.1 (Li 2018). The aligned reads were self-corrected using Canu v2.1.1 (Koren et al. 2017). The corresponding mitochondrial reads were manually assembled into a circular mitochondrion and polished using Nextpolish v1.3.1 by Illumina short reads (Hu et al. 2020).

To assemble the nuclear genome of N. ignobilis, we started with two types of long reads. The PacBio and ONT reads were assembled separately. Five assembly softwares, Wtdbg2 v2.5 (Ruan and Li 2020), Flye v2.8.3 (Kolmogorov et al. 2019), NextDenovo v2.3 (https://github.com/Nextomics/NextDenovo), Raven v1.5 (Vaser and Šikić 2021), and HASLR v0.8 (Haghshenas et al. 2020), were used for genome assembly with the default parameters. We obtained ten assemblies. Because mitochondria can cause over-polishing to the nuclear mitochondrial (NUMT) region, it is important to include the mitochondrial genome before genome polishing (Howe et al. 2021; Rhie et al. 2021). The mitochondrial sequence was tandemly replicated twice and added as a single contig to the end of each assembly. We performed two rounds of long reads correction using Racon v1.4.2 (Vaser et al. 2017) and one round of short reads correction using Nextpolish. The mitochondrial contig was discarded after polishing finished. One round of haplotig removal was carried out using Purge_dups v1.2.5 (Guan et al. 2020). Next, we followed the 3D de novo assembly (3D-DNA) pipeline (Dudchenko et al. 2017) and assembled each draft into a candidate chromosome-length assembly. Juicebox v1.22 was used to view the HiC-map of each assembly (Durand et al. 2016). Benchmarking Universal Single-Copy Orthologs (BUSCOs v4.1.4) (Simão et al. 2015) with Insecta_Odb10.2020-09-10 was used to evaluate the completeness of the assemblies.

We selected the NextDenovo assemblies of PacBio and ONT reads for subsequent manual curation because it has the highest contig N50, mounting size, BUSCO completeness, and the lowest gap number for both reads type (Table S1). The manual curation pipeline includes: (1) Fixing mis-assembly and mis-orientation using Juicebox based on Hi-C map, (2) adjusting local mis-orientation contigs by comparing raw draft contigs, (3) filling gaps on Hi-C produced scaffolds by aligning raw draft contigs and corrected-long reads to the ends of the gaps and find single or overlap contig/reads that cover the gap, and (4) aligning correct long reads to the ends of each chromosome to discover telomeres.

For gap-filling and telomere discovery, the error-prone long reads were corrected first by Fmlrc2 (Wang et al. 2018) with short reads and then self-corrected using Canu v2.1.1. The corrected PacBio reads >30KB and ONT reads > 50KB were selected. The non-repeat region of surrounding sequences of both sides of the gap was used as a query to search the raw draft assembly and the corrected long reads using blastn and MUMmer v3.5 (Kurtz et al. 2004). The ends of super scaffolds were queried to the corrected reads to find reads containing telomeric repeat. We used mummerplot v3.5 (Kurtz et al. 2004) to perform whole-genome alignment on the two cured assemblies to check their consistency. Finally, we used the curated Nanopore assembly (ONT_curated) as a template and fill the remaining gap using the Pacbio assembly. Hi-C heatmap of our final assemblies (NIG_v1.0) was plotted and visualized by hicPlotMatrix v3.6 (Wolff et al. 2020).

Genome annotation

We first masked tandem repeats using tandem repeat finder (Benson 1999). Extensive de-novo TE Annotator (EDTA) v1.9.4 (Ou et al. 2019) and RepeatModeler v2.02 (Flynn et al. 2020) were used to de-novo predict transposable elements (TEs). The results were combined with insect transposons in Dfam3.2, and Petersen's insert TE repertoire (Petersen et al. 2019), and cd-hit v4.8.1 (Fu et al. 2012) were used to remove the redundancy. This data set was then used as a library to search the TEs on N. ignobilis using RepeatMasker v4.1.2 (Tempel 2012).

We integrated the gene models generated from transcript alignment, homologous searching, and ab initio prediction for annotation protein-coding gene (PCG). For transcript evidence, we ran Iso-Seq analysis in Smrtlink v10.1 for PacBio Iso-Seq subreads to generate full-length cDNAs. The cDNAs were mapped to the genome using minimap2, and Stringtie v2.7.1 (Pertea et al. 2015) predicted the transcript structure. The RNA-Seq short reads were used to predict transcript structure using hisat2 v2.2.1 (Kim et al. 2019) and Stringtie. The transcript structures of long and short reads were integrated using Pasa v2.3.3 (Haas et al. 2003). For homologous searching, the annotation of D. melanogaster, A. pisum, A. mellifera, T. castaneum, and B. mori from the NCBI Refseq database as queries. Exonerate v2.4.0 (Slater and Birney 2005), GenomeThreader v1.7.3 (Gremme et al. 2013), and GeMoMa v1.7 (Keilwagen et al. 2016) were used to find the homologous gene model in N. ignobilis. For ab initio prediction, the high-quality transcripts were used to generate gene models for training and testing, followed by Hoff’s pipeline using the scripts in PASA and Augustus (Hoff and Stanke 2019). We used Augustus v3.3.2 (Stanke et al. 2006), GlimmerHMM (Majoros et al. 2004), Genemark-ES v4.57 (Lomsadze et al. 2005), and Snap v2006-07-28 (Korf 2004) to predict the ab initio gene model after training from high-quality gene models. Finally, the gene models from transcript alignment, homologous searching, and ab initio prediction were integrated to generate the final PCG annotation using EVM v1.1.1 (Haas et al. 2008). The isoforms, 5’-UTRs, and 3’-UTRs were annotated using Pasa. The completeness of the annotation was assessed using BUSCO. The function of PCGs was annotated using Eggnog v5.0.2 (Huerta-Cepas et al. 2019). The domains of peptides were annotated using InterProScan v5.52 (Jones et al. 2014). The expression levels of genes in N. ignobilis adult and larvae were estimated using RSEM v1.3.3 (Li and Dewey 2011).

For non-coding RNA annotation, tRNA were predicted by tRNAscan-SE v2.0.5 (Chan et al. 2021), and rRNA, snRNA, and snoRNA were searched using cmscan v1.1.3 (Nawrocki and Eddy 2013). The small RNA-seq library was used to search miRNAs using miRdeep2 v0.1.3 (Friedländer et al. 2012), combined with homologous searching identified miRNAs by blastn using the insect miRNA repertoires as query (Ma et al. 2021).

A high-quality genome of the dobsonfly Neoneuromus ignobilis reveals molecular convergences in aquatic insects

Data files

Abstract

Methods

Works referencing this dataset