Skip to main content

Genome sequencing of Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), a potential biological control agent for Mikania micrantha

Cite this dataset

Yang, Bin (2021). Genome sequencing of Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), a potential biological control agent for Mikania micrantha [Dataset]. Dryad.


The plant bug, Pachypeltis micranthus Mu et Liu (Hemiptera: Miridae), is a potential biological control agent for Mikania micrantha H.B.K. (Asteraceae; one of the most invasive weeds worldwide). To date, only a few studies have investigated plant bugs. Here, we performed a chromosome-level genome assembly of P. micranthus using MGISEQ-2000 short-read, Nanopore, PacBio long-read, and high-throughput chromosome conformation capture (Hi-C) techniques. The assembled genome was 712.72 Mb in size, with a contig N50 of 16.84 Mb. Using the Hi-C technique, 71 scaffolds were assembled into 15 chromosomes, accounting for 99.96%. We predicted 11,746 protein-coding genes in P. micranthus with 96.20% complete benchmarking universal single-copy orthologs. Phylogenomic analysis showed that P. micranthus and two other Miridae bugs (Apolygus lucorum and Nesidiocoris tenuis) diverged from the common ancestor approximately 200.01 million years ago. Chromosome synteny analysis between P. micranthus and A. lucorum indicated high-level synteny. Many gene families including chemosensory genes and digestive and detoxification enzyme genes—were significantly expanded in the P. micranthus genome. These expanded gene families may indicate the bug to adapt to the single host plant. This high-quality chromosome-level genome assembly provides an invaluable resource for further molecular and evolutionary research on mirid bugs and also provides a basis for further research on biological control mechanisms for M. micrantha.


Male adults of Pachypeltis micranthus were collected five days after emergence and used for genome sequencing. DNA was extracted from these male adult bugs and determined the quality and purity. The qualified DNA was used to construct the sequencing library. Next-generation sequencing was performed using the MGISEQ-2000 (MGI, China) platform. Oxford Nanopore GridION X5 (Oxford Nanopore Technologies, UK) and PacBio Sequel Ⅱ (Pacific Biosciences, CA, USA) platforms were used for third-generation sequencing. The high-throughput chromosome conformation capture (Hi-C) library was sequenced using the Illumina Novaseq (Illumina, FL, USA) platform. Three RNA-seq libraries of male adult bugs were constructed to annotate the genomic structure of P. micranthus. Sequencing was performed using an Illumina HiSeq 2000 (Illumina, FL, USA) platform. All sequenced raw data were filtered using Fastp v0.20.0 (Chen, Zhou, Chen, & Gu, 2018).

We obtained 113.67 Gb of MGISEQ-2000 short reads, 75.52 Gb of Nanopore long reads, and 12.36 Gb of PacBio long reads, with 154.80X, 98.87X, and 13.75X genome coverage, respectively. The filtered Nanopore reads were used for genome assembly with the NextDenovo package ( The original subreads were self-corrected using the NextCorrect module. In total, 13.6 Gb of consistent sequences was obtained and used for preliminary genome assembly using the NextGraph module. The preliminary assembly was performed with a genome size of 710.81 Mb and a median contig length (N50) of 16.82 Mb. Finally, the contigs were corrected three times with NextPolish v1.3.0 (Hu, Fan, Sun, & Liu, 2020) using Nanopore long reads and refined four times with the Racon module using the MGISEQ-2000 short reads. The final polished genome size was 712.72 Mb with a contig N50 of 16.84 Mb. Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.0.5 (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015), Bwa v0.7.12 (Li & Durbin, 2010) and CEGMA v2 (Parra, Bradnam, & Korf, 2007) results showed that the genome assembly was complete and accurate. Minimap2 v41 (Li, 2016) result indicated that the genome wasn’t contaminated.

The clean reads of RNA-seq were mapped to the P. micranthus genome using Hisat2 v2.1.0 (Kim, Langmead, & Salzberg, 2015). The fragments per kilobase per million (FPKM) values were calculated using StringTie (Pertea et al., 2015). Using the three methods of gene prediction (RNA-seq data, de novo prediction, and homology detection), we predicted 11,746 protein-coding genes with an average transcript length of 32,170.81 bp, an average CDS length of 1,515.18 bp, an average exon length of 193.96 bp, an average number of 7.82 exons per gene, and an average intron length of 4,496.87 bp.

We annotated the tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome. First, TRs were detected using Tandem Repeats Finder (TRF, v4.07) (Benson, 1999) and GMATA v2.2, using default parameters (Wang & Wang, 2016). The TRF was used to search for TRs, and GMATA was mainly used to search for simple sequence repeats (SSRs) of TRs. For TE identification, we constructed a de novo repeat library using RepeatModeler v1.0.11, with default parameters (Bedell, Korf, & Gish, 2000). Known and novel TEs in the bug genome were identified by aligning the proteins against the de novo repeat library and the Repbase TE library with RepeatMasker v1.331 (Bedell et al., 2000). In total, we generated 375.82 Mb (52.73% of the genome) of repeat sequences that were classified into 12 types. Among these, DNA transposons (19.19% of the genome) and long interspersed nuclear elements (18.16% of the genome) accounted for the majority of the TEs (48.39% of the genome). A total of 20,783 SSRs (0.03% of the genome) were identified in the TRs (0.02% of the genome).

For chromosome-level genome assembly, we obtained 112 Gb of raw data. After filtering, 110.96 Gb of clean data was generated. The high-quality reads were mapped to the assembled genome using Bowtie2 v2.3.2 (sensitive end-to-end) (Langmead & Salzberg, 2012). The unique mapped paired-end reads obtained were used to identify and retain valid interaction paired reads using HiC-Pro v2.8.1 (Servant et al., 2015). Based on the valid interaction paired reads, the scaffolds were ordered, oriented, and joined into chromosomes using LACHESIS ( (Burton et al., 2013). Finally, we manually adjusted the position and orientation errors showing significant chromatin interactions. a total of 71 scaffolds (total length—712.46 Mb) were anchored, ordered, and oriented into 15 chromosomes (total length—707.51 Mb), accounting for 99.27% of the assembled P. micranthus genome.


Bedell, J. A., Korf, I., & Gish, W. (2000). MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics, 16(11), 1040-1041. doi:10.1093/bioinformatics/16.11.1040

Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2), 573-580. doi:10.1093/nar/27.2.573

Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O., & Shendure, J. (2013). Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology, 31(12), 1119-1125. doi:10.1038/nbt.2727

Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), 884-890. doi:10.1093/bioinformatics/bty560

Hu, J., Fan, J., Sun, Z., & Liu, S. (2020). NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics, 36(7), 2253-2255. doi:10.1093/bioinformatics/btz891

Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat Methods, 12(4), 357-357. doi:10.1038/nmeth.3317

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods, 9(4), 357-357. doi:10.1038/nmeth.1923

Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 2103-2110. doi:10.1093/bioinformatics/btw152

Li, H., & Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5), 589-595. doi:10.1093/bioinformatics/btp698

Parra, G., Bradnam, K., & Korf, I. (2007). CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23(9), 1061-1067. doi:10.1093/bioinformatics/btm071

Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T.-C., Mendell, J. T., & Salzberg, S. L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33(3), 290-295. doi:10.1038/nbt.3122

Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., . . . Barillot, E. (2015). HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology, 16, 259. doi:10.1186/s13059-015-0831-x

Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 31(19), 3210-3212. doi:10.1093/bioinformatics/btv351

Wang, X., & Wang, L. (2016). GMATA: an integrated software package for genome-scale SSR mining, marker development and viewing. Frontiers in Plant Science, 7, 1350. doi:10.3389/fpls.2016.01350

Usage notes

Notes of the files:

1Pachypeltis_micranthus.contigs.fasta—The contig sequences of assembled Pachypeltis micranthus genome;

2Pachypeltis_micranthus.cds.fasta—The predicted coding sequences (CDS) of Pachypeltis micranthus.

3Pachypeltis_micranthus.pep.fasta—The predicted protein sequences of Pachypeltis micranthus.

4Pachypeltis_micranthus.chromoses.fasta—The chromosome sequences of Pachypeltis micranthus.

5Pachypeltis_micranthus.genome.gff—The characteristics description of Pachypeltis micranthus genome.

6Pachypeltis_micranthus.repeat.masked.fasta—The annotated tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome.

7Pachypeltis_micranthus.repeat.gff—The characteristics description of the tandem repeats (TRs) and transposable elements (TEs) in the P. micranthus genome.


Province Key R&D Program of Yunnan, Award: 2018BB009

Province Key R&D Program of Yunnan, Award: 2018BB009