Genome sequence and silkomics of the spindle ermine moth, Yponomeuta cagnagella, representing the early diverging lineage of the ditrysian Lepidoptera
Data files
Nov 29, 2022 version files 966.41 MB
Abstract
Many lepidopteran caterpillars produce silk, cocoons, feeding tubes, or nests for protection from predators and parasites. Yet, the number of lepidopteran species whose silk composition has been studied in detail is very small, because the genes encoding the major structural silk proteins tend to be large and repetitive, making their assembly and sequence analysis difficult. Here we have analyzed the silk of Yponomeuta cagnagella, which represents one of the early diverging lineages of the ditrysian Lepidoptera thus improving the coverage of the order. To obtain a comprehensive list of the Y. cagnagella silk genes, we sequenced, assembled, and annotated the draft genome using Oxford Nanopore and Illumina technologies. The 626 Mb assembly with N50 of 96.5 kb contained 96.9% insect orthologs recovered by BUSCO and 30,003 predicted gene models. We then used a silk-gland transcriptome and a silk proteome to identify major silk components and verified the tissue specificity of the expression of individual genes.
Methods
To assemble the genome of Y. cagnagella, Oxford Nanopore reads were sequenced on the Nanopore PromethION platform. In addition, an Illumina library with a 700 bp insert size was prepared and sequenced on the Illumina HiSeq 2500 with 250 bp paired-end reads. The raw reads were deposited in NCBI under SRA accession numbers SRR15714088 and SRR15714089.
For genome annotation, RNA from heads, thoraces, and gonads of three male and female imagoes was extracted with TRI-Reagent. Biological replicas were pooled prior to isolation, resulting in three tissue-specific samples per sex.
First, adaptor sequences and low quality bases were filtered out of the Illumina data using Trimmomatic. Similarly, Nanopore reads shorter than 500 bp and with a quality score lower than 7 were removed from the dataset with NanoFilt. Next, the FM-index Long Read Corrector (FMLRC) was used with default settings to correct the long reads using the filtered Illumina sequences. As recommended, ropebwt2 and fmlrc-convert were used to construct the multi-string BWT data structure required by the FMLRC pipeline. The preprocessed long reads were then assembled with Flye.
To eliminate the haplotypic duplications from the primary assembly, purge_dups pipeline was applied, followed by polishing using POLCA. Repeat composition and average GC content were analysed with RepeatModeler and RepeatMasker software packages. To achieve more accurate masking, major satellites were identified with TAREAN from Illumina data subsampled to 0.25× genome coverage. A custom repeat library built from the genome sequence with RepeatModeler with added satellite dimers was used in RepeatMasker pipeline to survey the landscape of repetitive elements and generate a masked version of the Y. cagnagella assembly.
To annotate the assembly, all RNA-seq data were concatenated into a single dataset, including the silk gland RNA-seq (see below) and the quality of the sequencing was verified using FastQC. The resulting data were aligned to the masked genome assembly using STAR. The genome index was generated with the following parameter scaled down to the size of Y. cagnagella genome: “--genomeSAindexNbases 13”. Genes were predicted with BRAKER and annotated using BLASTp with NCBI RefSeq invertebrate protein database, all implemented in the GenSAS platform.
For analysis of silk genes, total RNA from the last larval instar silk glands was isolated using TRIzol reagent, followed by isolation of mRNA using Dynabeads Oligo (dT)25 mRNA Purification Kit, and cDNA was prepared using the NEXTflex Rapid RNA-Seq Kit. The cDNA library was sequenced on Illumina platform 2×150 bp (paired-end reads) with MiSeq. 150 bp paired-end Illumina reads were visually inspected for quality using FastQC. Adaptor sequences removal and trimming were performed using BBDUK. A further rRNA contamination step was conducted using BBDUK with the associated ribokmers.fa file to eliminate rRNA contamination from the mRNA enrichment step of library preparation. Cleaned reads were assembled into a transcriptome using the multi k-mer rnaSPAdes assembler. K-mer sizes of 25, 35, 45, 55, 65, and 75 were chosen for de novo assembly to increase the likelihood of maximum transcript recovery.