The genome assemble of Bemisia tabaci MED
Data files
May 16, 2023 version files 637.58 MB
Abstract
Aim: The sweet potato whitefly, Bemisia tabaci MED is a globally invasive species that causes serious economic damage to agroecosystems. Despite the significant threat it poses to agricultural and economic crops worldwide, the global perspective of the invasion patterns and genetic mechanism contributing to the success of this notorious pest is still poorly understood. The objective of this research was to enhance genome and population genetic analyses to better understand the intricate invasion patterns of B. tabaci MED.
Location: Samples were collected in native (Spain, Croatia, Bosnia and Herzegovina, Cyprus, and Israel) and invaded regions (China, South Korea and North America).
Methods: We first assembled a chromosome-scale reference genome of B. tabaci MED, and then employed the restriction site‐associated 2b‐RAD method to genotype over 20, 000 high‐quality single nucleotide polymorphisms from 29 geographical populations.
Results: A reference genome of B. tabaci MED, with a size of 637.47 Mb, was available. The majority of the assembled sequences (99%) were anchored onto ten linkage groups, with an N50 size of 58.76 Mb, representing a significant improvement over previous whitefly genome assemblies. We identified rapidly expanded gene families and positively selected genes, probably contributing to successful invasion and rapid adaptation to the new environment. Population genomics analysis showed that three highly differentiated genetic groups were formed, and complex and extensive gene flow occurred across the Mediterranean populations. The genetic admixture patterns in East Asia populations were distinct from those in North America, indicating that they had different source populations.
Conclusions: The high-quality, chromosome-scale genome of B. tabaci MED offered opportunities for more comprehensive genome-wide studies, and provided a solid foundation for the complex introduction events and the differential invasiveness of B. tabaci MED worldwide.
Methods
Genome survey and sequencing
To evaluate the genome size of B. tabaci MED, an Illumina high-throughput sequencing library with an insert size of 350 bp was constructed from genomic DNA and paired-end sequenced on the Illumina NovaSeq-6000 platform. We performed the k-mer analysis to determine the genome size using Jellyfish v2.1.3 software with the k-mer value of 17 (Figure S1) (Marcais and Kingsford, 2011). Afterwards, the repeat content and heterozygosity rate were calculated with GenomeScope 2.0 software following the k-mer frequency distribution (Ranallo-Benavidez et al., 2020).
A short paired-end libraries (350-bp insert size) were generated and sequenced on the Illumina NovaSeq-6000 platform for Illumina sequencing. A SMRT bell library (20-kb insert size) was constructed and run on the PacBio Sequel II platform for Pacific Biosciences (PacBio) sequencing. We filtered the Illumina raw sequencing data by trimming the adapters, over 10% of nucleotides (N) designated as unknown, and 50% of low quality bases (Q-value <= 10), generating a total of 105.92Gb clean and high-quality reads. For the raw data from PacBio, short subreads (<5 kb) were removed and only a single representative subread was kept for each PacBio read.
Approximately 300 pairs of fresh B. tabaci MED adult samples were prepared for Hi-C library construction with a slight modification of the procedure described by Putnam et al., 2016. The procedure of cross-linking involved subjecting the cells to a 2% formaldehyde solution, which was terminated by the addition of 2.5M glycine. Afterwards, the cells were quickly frozen in liquid nitrogen and preserved at a temperature of -80℃. Then, the chromatin underwent digestion utilizing the Hind III restriction enzyme. The preparation of the Hi-C library involved incubating it with biotin, followed by the addition of T4 DBA ligase to facilitate DNA ligation. After adding the Proteinase K for reverse cross-linking, the ligated DNA was then cut into fragments that varied in size from 300 to 700 bp. It was repaired by blunt-end ligation and A-tailing, after which it was purified using biotin-streptavidin pull-down. They were sequenced using the Illumina NovaSeq6000 platform after quality assessment.
Genome assembly
We constructed the draft genome by utilizing the raw reads produced by both the Illumina and PacBio sequencing platforms. PacBio long reads established the genome framework and Illumina short reads improved the genome assembly. The correction of PacBio long reads was carried out first using the CANU program v1.7, followed by assembling the corrected reads into contigs using SmartDeNovo (available at https://github.com/ruanjue/smartdenovo) (Koren et al., 2017). The PacBio reads were used to improve the primary contigs with Quiver software (SMRT Link v5.0.1) and the gaps were filled with PBJelly software (English et al., 2012). Finally, the Illumina short reads were used to correct the genome assembly errors with Pilon v1.22 software (Walker et al., 2014).
Th Hi-C reads were aligned to the B. tabaci MED draft genome and only the data that was uniquely mapped were used for assembly with the LACHESIS software (Langmead et al., 2009; Burton et al., 2013). After removing the duplicate data, the HiC-Pro (version 2.8.1) was applied to assess the quality (Servant et al., 2015). Afterwards, the corrected scaffolds were put together using LACHESIS and the interaction matrix of all chromosomes was generated along with heatmaps to provide a visual representation. The quality and completeness were assessed by conducting Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis using busco version 3.0 (Li, 2013; Simão et al., 2015) and Core Eukaryotic Genes Mapping Approach (CEGMA) analysis using cegma v2.5 (Parra et al., 2007).
Genome annotation
We first constructed a de novo library using RepeatModeler (http://www.repeatmasker.org/RepeatModeler/), LTR_Finder and RepeatScout (Price et al., 2005; Xu et al., 2007; Hoede et al., 2014). Then, the de novo libraries of repeat sequences merging with the Repbase database (Jurka et al., 2005) were used to search against B. tabaci MED genome for the discovery of repeat sequences using RepeatMasker (http://www.repeatmasker.org/) (Tarailo-Graovac & Chen, 2009). Tandem repeats in the genome were identified using Tandem Repeat Finder v4.07b. With these procedures, we successfully annotated the repeat sequences of the B. tabaci MED genome.
We used multiple methods to identify the conventional small non-coding RNAs, such as small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs). In detail, rRNAs were located through a BLAST search against a database of invertebrate rRNA, with a threshold E-value of E-value of 1e-10.The tRNAscan-SE software was used for the identification of tRNAs (Lowe & Eddy, 1997). The identification of miRNAs, snRNAs, and snoRNAs was performed using INFERNAL (v1.1rc4) and searching against the Rfam database (Nawrocki & Eddy, 2013).
We applied three different gene annotation methods including transcriptome-based annotation, homology-based annotation, and ab initio prediction for the gene annotation. Specifically, the homolog‐based approach employed in GeMoMa v1.3.1 software by referring the protein sequences of Drosophila_melanogaster, Halyomorpha_halys, Nilaparvata_lugens from GenBank and those of Bemisia_tabaci from the GigaScience database (Chen et al., 2016, 2019; Keilwagen et al., 2016; Xie et al., 2017).
The functions of the genes were determined through alignment with several widely used databases, including eukaryotic orthologous groups of proteins (KOG), National Center for Biotechnology Information (NCBI), non-redundant protein database (NR), Trembl, Kyoto Encyclopedia of Genes and Genomes (KEGG), and swiss‐prot. The alignment was performed using blast v2.2.31 with a 1e‐5 threshold (Altschul et al., 1990). The Gene Ontology (GO) database was used for gene annotation via the blast2go software following the aligned results from the NR database (Conesa et al., 2005; Marchler-Bauer et al., 2010).