Genome sequencing of Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), a potential biological control agent for Mikania micrantha
Yang, Bin (2021), Genome sequencing of Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), a potential biological control agent for Mikania micrantha, Dryad, Dataset, https://doi.org/10.5061/dryad.0k6djhb18
The plant bug, Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), is a potential biological control agent for Mikania micrantha H.B.K. (Asteraceae; one of the most invasive weeds worldwide). To date, only a few studies have investigated plant bugs. Here, we performed a chromosome-level genome assembly of P. micranthus using MGISEQ-2000 short-read, Nanopore, PacBio long-read, and high-throughput chromosome conformation capture (Hi-C) techniques. The assembled genome was 712.72 Mb in size, with a contig N50 of 16.84 Mb. Using the Hi-C technique, 71 scaffolds were assembled into 15 chromosomes, accounting for 99.96%. We predicted 11,746 protein-coding genes in P. micranthus with 96.20% complete benchmarking universal single-copy orthologs. Phylogenomic analysis showed that P. micranthus and two other Miridae bugs (Apolygus lucorum and Nesidiocoris tenuis) diverged from the common ancestor approximately 200.01 million years ago. Chromosome synteny analysis between P. micranthus and A. lucorum indicated high-level synteny. Many gene families including chemosensory genes and digestive and detoxification enzyme genes—were significantly expanded in the P. micranthus genome. These expanded gene families may indicate the bug to adapt to the single host plant. This high-quality chromosome-level genome assembly provides an invaluable resource for further molecular and evolutionary research on mirid bugs and also provides a basis for further research on biological control mechanisms for M. micrantha.
Male adults of Pachypeltis micranthus were collected five days after emergence and used for genome sequencing. DNA was extracted from these male adult bugs and determined the quality and purity. The qualified DNA was used to construct the sequencing library. Next-generation sequencing was performed using the MGISEQ-2000 (MGI, China) platform. Oxford Nanopore GridION X5 (Oxford Nanopore Technologies, UK) and PacBio Sequel Ⅱ (Pacific Biosciences, CA, USA) platforms were used for third-generation sequencing. The high-throughput chromosome conformation capture (Hi-C) library was sequenced using the Illumina Novaseq (Illumina, FL, USA) platform. Three RNA-seq libraries of male adult bugs were constructed to annotate the genomic structure of P. micranthus. Sequencing was performed using an Illumina HiSeq 2000 (Illumina, FL, USA) platform. All sequenced raw data were filtered using Fastp v0.20.0 (Chen, Zhou, Chen, & Gu, 2018).
We obtained 113.67 Gb of MGISEQ-2000 short reads, 75.52 Gb of Nanopore long reads, and 12.36 Gb of PacBio long reads, with 154.80X, 98.87X, and 13.75X genome coverage, respectively. The filtered Nanopore reads were used for genome assembly with the NextDenovo package (https://github.com/Nextomics/NextDenovo). The original subreads were self-corrected using the NextCorrect module. In total, 13.6 Gb of consistent sequences was obtained and used for preliminary genome assembly using the NextGraph module. The preliminary assembly was performed with a genome size of 710.81 Mb and a median contig length (N50) of 16.82 Mb. Finally, the contigs were corrected three times with NextPolish v1.3.0 (Hu, Fan, Sun, & Liu, 2020) using Nanopore long reads and refined four times with the Racon module using the MGISEQ-2000 short reads. The final polished genome size was 712.72 Mb with a contig N50 of 16.84 Mb. Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.0.5 (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015), Bwa v0.7.12 (Li & Durbin, 2010) and CEGMA v2 (Parra, Bradnam, & Korf, 2007) results showed that the genome assembly was complete and accurate. Minimap2 v41 (Li, 2016) result indicated that the genome wasn’t contaminated.
The clean reads of RNA-seq were mapped to the P. micranthus genome using Hisat2 v2.1.0 (Kim, Langmead, & Salzberg, 2015). The fragments per kilobase per million (FPKM) values were calculated using StringTie (Pertea et al., 2015). Using the three methods of gene prediction (RNA-seq data, de novo prediction, and homology detection), we predicted 11,746 protein-coding genes with an average transcript length of 32,170.81 bp, an average CDS length of 1,515.18 bp, an average exon length of 193.96 bp, an average number of 7.82 exons per gene, and an average intron length of 4,496.87 bp.
We annotated the tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome. First, TRs were detected using Tandem Repeats Finder (TRF, v4.07) (Benson, 1999) and GMATA v2.2, using default parameters (Wang & Wang, 2016). The TRF was used to search for TRs, and GMATA was mainly used to search for simple sequence repeats (SSRs) of TRs. For TE identification, we constructed a de novo repeat library using RepeatModeler v1.0.11, with default parameters (Bedell, Korf, & Gish, 2000). Known and novel TEs in the bug genome were identified by aligning the proteins against the de novo repeat library and the Repbase TE library with RepeatMasker v1.331 (Bedell et al., 2000). In total, we generated 375.82 Mb (52.73% of the genome) of repeat sequences that were classified into 12 types. Among these, DNA transposons (19.19% of the genome) and long interspersed nuclear elements (18.16% of the genome) accounted for the majority of the TEs (48.39% of the genome). A total of 20,783 SSRs (0.03% of the genome) were identified in the TRs (0.02% of the genome).
For chromosome-level genome assembly, we obtained 112 Gb of raw data. After filtering, 110.96 Gb of clean data was generated. The high-quality reads were mapped to the assembled genome using Bowtie2 v2.3.2 (sensitive end-to-end) (Langmead & Salzberg, 2012). The unique mapped paired-end reads obtained were used to identify and retain valid interaction paired reads using HiC-Pro v2.8.1 (Servant et al., 2015). Based on the valid interaction paired reads, the scaffolds were ordered, oriented, and joined into chromosomes using LACHESIS (https://github.com/shendurelab/LACHESIS) (Burton et al., 2013). Finally, we manually adjusted the position and orientation errors showing significant chromatin interactions. a total of 71 scaffolds (total length—712.46 Mb) were anchored, ordered, and oriented into 15 chromosomes (total length—707.51 Mb), accounting for 99.27% of the assembled P. micranthus genome.
Bedell, J. A., Korf, I., & Gish, W. (2000). MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics, 16(11), 1040-1041. doi:10.1093/bioinformatics/16.11.1040
Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2), 573-580. doi:10.1093/nar/27.2.573
Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O., & Shendure, J. (2013). Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology, 31(12), 1119-1125. doi:10.1038/nbt.2727
Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), 884-890. doi:10.1093/bioinformatics/bty560
Hu, J., Fan, J., Sun, Z., & Liu, S. (2020). NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics, 36(7), 2253-2255. doi:10.1093/bioinformatics/btz891
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat Methods, 12(4), 357-357. doi:10.1038/nmeth.3317
Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods, 9(4), 357-357. doi:10.1038/nmeth.1923
Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 2103-2110. doi:10.1093/bioinformatics/btw152
Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589-595. doi:10.1093/bioinformatics/btp698
Parra, G., Bradnam, K., & Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23(9), 1061-1067. doi:10.1093/bioinformatics/btm071
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., & Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33(3), 290-295. doi:10.1038/nbt.3122
Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., . . . Barillot, E. (2015). HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology, 16, 259. doi:10.1186/s13059-015-0831-x
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-3212. doi:10.1093/bioinformatics/btv351
Wang, X., & Wang, L. (2016). GMATA: an integrated software package for genome-scale SSR mining, marker development and viewing. Frontiers in Plant Science, 7, 1350. doi:10.3389/fpls.2016.01350
Notes of the files:
1Pachypeltis_micranthus.contigs.fasta—The contig sequences of assembled Pachypeltis micranthus genome;
2Pachypeltis_micranthus.cds.fasta—The predicted coding sequences (CDS) of Pachypeltis micranthus.
3Pachypeltis_micranthus.pep.fasta—The predicted protein sequences of Pachypeltis micranthus.
4Pachypeltis_micranthus.chromoses.fasta—The chromosome sequences of Pachypeltis micranthus.
5Pachypeltis_micranthus.genome.gff—The characteristics description of Pachypeltis micranthus genome.
6Pachypeltis_micranthus.repeat.masked.fasta—The annotated tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome.
7Pachypeltis_micranthus.repeat.gff—The characteristics description of the tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome.
Province Key R&D Program of Yunnan, Award: 2018BB009
Province Key R&D Program of Yunnan, Award: 2018BB009