Genes for major ribosomal RNAs (rDNA) are present in multiple copies organized in tandem arrays. Number and position of rDNA loci can change dynamically and their re-patterning is presumably driven by repetitive sequences. We explored a peculiar rDNA organization in several representatives of Lepidoptera with either extremely large or numerous rDNA clusters. We combined molecular cytogenetics with analyses of second and third generation sequencing data to show that rDNA spreads as a transcription unit and reveal association between rDNA and various repeats. Furthermore, we performed comparative long read analyses between the species with derived rDNA distribution and moths with a single rDNA locus, which is considered ancestral. Our results suggest that satellite arrays, rather than mobile elements, facilitate homology-mediated spread of rDNA via either integration of extrachromosomal rDNA circles or ectopic recombination. The latter arguably better explains preferential spread of rDNA into terminal regions of lepidopteran chromosomes as efficiency of ectopic recombination depends on proximity of homologous sequences to telomeres.

Repeat Explorer analysis

For analysis of repetitive DNA content, whole gDNA was sequenced on the Illumina platform generating either 150 bp pair-end reads from library with mean insert size 450 bp (Novogene Co., Ltd., Beijing, China) or 250 bp PE reads with the mean insert size 700 bp in case of C. ohridella (Genomics Core Facility, EMBL Heidelberg, Germany). The raw reads were quality filtered and trimmed to uniform length of 120 bp (230 bp for C. ohridella) by Trimmomatic 3.2 (Bolger et al., 2014). Random sample of two million (one million for C. ohridella) trimmed PE reads was analysed by RE pipeline (version cerit-v0.3.1-2706) implemented in Galaxy environment (https://repeatexplorer-elixir.cerit-sc.cz/galaxy/) with automatic annotation via blastn and blastx using the Metazoan 3 Repeat Explorer database. The resulting html files were searched for clusters annotated as major rDNA and their connection to other clusters.

Long read sequencing and analysis

High molecular weight DNA from H. humuli was enriched for fragments longer than 10 kbp by Short Read Eliminator (Circulomics Inc). The library was prepared by Ligation Sequencing Kit SQK-LSK110 (Oxford Nanopore Technologies, Oxford, UK) according to the manufacture’s protocol and therein recommended third party consumables. The library was snap-frozen and stored over night at -70°C and then sequenced using flowcell R10.3 and MinION Mk1B (Oxford Nanopore Technologies). Reads were basecalled by guppy 4.4.1. with high accuracy flip-flop algorithm. The data was filtered for reads 15kbp and longer with quality score over 10 using NanoFilt (De Coster et al., 2018).

Quality and length filtered reads were searched for presence of major rDNA using blastn. Reads containing at least 1000 bp of H. humuli major rDNA unit were assembled by Flye 2.8 (Kolmogorov et al., 2019) using minimal overlap 8 kbp. The annotation of MEs was done by RepeatMasker 4.1.2-p1 (Smit et al., 2013) protein-based masking. Tandem repeats were identified based on self Dotplot implemented in Geneious 11.1.5. Consensus sequences of all identified ME fragments together with major rDNA unit were mapped to individual rDNA bearing nanopore reads using minimap2 (Li, 2018) with appropriate pre-set. The presence and relative localization of individual elements was evaluated via R script (R version 4.0.3 in Rstudio version 1.4.1103) , only regions with mapping quality at least 20 were considered.

Phymatopus californicus gDNA was sequenced on Oxford Nanopore platform in Novogene Co.,Ltd. PacBio HiFi reads of I. io (project PRJEB42130) and A. urticae (project PRJEB42112) were obtain through the Darwin Tree of Life project (http://www.darwintreeoflife.org). PacBio CLR data were obtained from Sequence Read Archive (SRA) database (S. frugiperda SRR12642577; L. dispar SRR13505170-6, SRR13505182-3, and SRR13505187; P.xylostella SRR13530960). Further, the reads were processed same as in H. humuli except for the HiFi reads, which were not quality filtered.

Similar approach to detect rDNA and associated repetitive DNA was used also in A. urticae chromosomal level genome assembly (Bishop et al., 2021) (ENA acc. No. PRJEB41896).

Coverage analysis

Coverage analysis was done by aligning genomic Illumina sequencing reads from H. humuli I. io, and A. urticae to consensus sequences, which were generated by overlapping the contigs from RE in Geneious 11.1.5 or by Flye 2.8 assembler, using Bowtie2 aligner (Langmead et al., 2019; Langmead and Salzberg, 2012). Coverage values were obtained using samtools depth (v 1.10) (Li et al., 2009) and plotted using a script in R (R version 4.1.0 in Rstudio Workbench Version 1.4.1717-3). Mean coverage of defined annotation blocks as seen in Figure 3 was computed using R and is in Suppl. Tables 3.

Au_assembly_rDNA.fasta: A. urticae rDNA region assembly

Au_rDNA_reads.fasta: A. urticae PacBio reads containing at least 200 bp rDNA based on blast results

Au_sat_only_reads.fasta: A. urticae PacBio reads containing at least 200 bp AuSat, however, no rDNA or othe IGS parts

Hh_assembly_rDNA.fasta: H. humuli rDNA region assembly

Hh_rDNA_reads.fasta: H. humuli Oxford Nanopore reads containing at least 200 bp rDNA based on blast results

Ii_assembly_rDNA.fasta: I. io rDNA region assembly

Ii_rDNA_reads.fasta: I. io PacBio reads containing at least 200 bp rDNA based on blast results

Ii_sat_only_reads.fasta: I. io PacBio reads containing at least 200 bp IiSat, however, no rDNA or othe IGS parts

Ld_assembly_rDNA.fasta: L. dispar rDNA region assembly

Ld_rDNA_reads.fasta: L. dispar PacBio reads containing at least 200 bp rDNA based on blast results

Pc_assembly_rDNA.fasta: P. californicus rDNA region assembly

Pc_rDNA_reads.fasta: P. californicus Oxford nanopore reads containing at least 200 bp rDNA based on blast results

Px_assembly_rDNA.fasta: P. xylostella rDNA region assembly

Px_rDNA_reads.fasta: P. xylostella PacBio reads containing at least 200 bp rDNA based on blast results

R_cov_analysis.html: R code used for illumina rDNA units coverage analysis

R_rDNA_long_reads.html: R code used for long reads rDNA and associated repeats ananlysis

Sf_assembly_rDNA.fasta: S. frugiperda rDNA region assembly

Sf_rDNA_reads.fasta: S. frugiperda PacBio reads containing at least 200 bp rDNA based on blast results

Data for: The role of repetitive DNA in re-patterning of major rDNA clusters in Lepidoptera

Data files

Abstract

Data for: The role of repetitive DNA in re-patterning of major rDNA clusters in Lepidoptera

Data files

Abstract

Methods

Usage notes

Works referencing this dataset