Population genomic signatures of the oriental fruit moth related to the Pleistocene climates
Data files
Jan 11, 2022 version files 1.24 GB
-
data_for_graphs.zip
10.07 MB
-
gm_31sample_all_mac1_hwee7.recode.vcf.gz
1.23 GB
Abstract
The Quaternary climatic oscillations are expected to have had strong impacts on the evolution of species. Although legacies of the Quaternary climates on population processes have been widely identified in diverse groups of species, adaptive genetic changes shaped during the Quaternary have been harder to decipher. Here, we assembled a chromosome-level genome of the oriental fruit moth and compared genomic variation among refugial and colonized populations of this species that diverged in the Pleistocene. High genomic diversity was maintained in refugial populations. Demographic analysis showed that the effective population size of refugial populations declined during the penultimate glacial maximum (PGM) but remained stable during the last glacial maximum (LGM), indicating a strong impact of the PGM rather than the LGM on this pest species. Genome scans identified one chromosomal inversion and a mutation of the circadian gene Clk on the neo-Z chromosome potentially related to the endemicity of a refugial population. In the colonized populations, genes in pathways of energy metabolism and wing development showed signatures of selection. These different genomic signatures of refugial and colonized populations point to multiple impacts of Quaternary climates on adaptation in an extant species.
Samples
A laboratory-reared strain of the OFM was used for de novo genome sequencing. This strain derived from three male and female pairs and had been maintained for ten generations on apples in laboratory conditions. We collected 263 individuals from 15 geographical populations across the native range of the OFM in China, of which three populations were collected from Sichuan basin, three populations were from the Yunnan region, and six populations were from regions where OFM subsequently dispersed (Table S1, Fig. 2). We chose three representative populations (31 individuals in total) for whole-genome resequencing; one population (YNHH) was collected from the original and refugial area in Yunnan, one (SCCD) was from another refugial area in Sichuan, while the last population (BJPG) was from colonized areas of northern China (Table S1 and Fig. 2). The other 12 populations (232 individuals) were genotyped by the Kompetitive Allele-Specific PCR (KASP) method for 22 representative SNP outliers (see below).
De novo genome assembly
We constructed and sequenced an Illumina library, a NanoPore library, a Hi-C proximity ligation library, and four RNA-seq libraries (eggs, larvae, pupae, and adults) for assembly and annotation of the OFM genome. The raw reads generated from the Illumina platform were filtered by Trimmomatic v0.38 (Bolger et al. 2014) and then used to estimate genome size, heterozygosity, and duplication rate using GenomeScope v1.0 (Vurture et al. 2017).
Long reads generated from the NanoPore platform were corrected and assembled using CANU version v1.8 (Koren et al. 2017) with default parameters. The assembled contigs were polished based on Illumina short reads using Pilon v1.22 (Walker et al. 2014). To remove the possible secondary alleles, the assembled contigs were filtered using the pipeline Purge Haplotigs (Roach et al. 2018), resulting in a contig-level genome. The Illumina short reads sequenced from the Hi-C library were used to assemble these contigs into a chromosome-level genome using the Juicer v1.5 (Neva C. et al. 2016) and 3D de novo assembly (3D-DNA) pipelines (Dudchenko et al. 2017).
The completeness of each assembled version of the genome was assessed using a Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 (Simao et al. 2015) analysis, based on the insecta_odb9 database (1,658 genes). We conducted a synteny analysis between OFM and the codling moth Cydia pomonella (Lepidoptera: Tortricidae) (Assembly accession: GCA_003425675.2) (Wan et al. 2019) and Spodoptera litura (Lepidoptera: Noctuidae) (Assembly accession: GCF_002706865.1) (Cheng et al. 2017) using MCSCAN (Wang et al. 2012).
Genome annotation
Repeats and transposable element families in the OFM genome were detected by RepeatMasker pipeline v4.0.7 (Tarailo-Graovac & Chen 2009) against the Insecta repeats within RepBase Update (http://www.girinst.org) and Dfam database (20170127), with RMBlast v2.10.0 as a search engine. tRNAs were annotated by tRNAscan-SE (Lowe & Eddy 1997) with default parameters; rRNAs were annotated by RNAmmer prediction (Lagesen et al. 2007).
The protein-coding gene in the OFM genome was annotated using ab initio, RNA-seq-based, and homolog-based methods in the MAKER version 3.01.03 genome annotation pipeline (Cantarel et al. 2008). For the RNA-seq-based method, the RNA-seq reads were first mapped to the genome of OFM with Hisat v2.2.0, and then the transcripts were assembled using StringTie v2.1.2. For ab initio methods, parameters of SNAP v2013-02-16 (Korf 2004) and Augustus v3.2.3 (Stanke & Waack 2003) were estimated or trained before using them to predict genes in MAKER. The SNAP parameters were estimated from high-quality transcripts obtained by improvement and filtering using PASA v2.4.1 (Brian J et al. 2003). The gene model of Augustus was directly obtained from the above BUSCO analysis of the genome assembly. For the homolog-based method, protein-coding genes of Drosophila melanogaster and Bombyx mori were used. Then, fragments per kilobase per million (FPKM) values of each gene predicted by the MAKER pipeline were calculated using cufflinks version 2.2.1 (Kim et al. 2013); the gene set was filtered by keeping those with an FPKM value > 0 in any RNA-seq library. Finally, PASA was used to update the annotation based on transcripts; all predictions were further filtered using GffRead v0.11.7 implemented in Cufflinks v2.2.1 (Kim et al. 2013) to remove genes having in-frame stop codons. Functions of the protein-coding genes were annotated using eggNOG-Mapper v1.0.3 (Huerta-Cepas et al. 2017) against the database EggNOG v5.0 (Huerta-Cepas et al. 2019). Genes that can be functionally annotated by EGGNOG analysis were retained in gene structure annotation and used for further analysis.
Gene family annotation
Protein-coding genes from available genomes of two Coleoptera, two Diptera and another 11 Lepidoptera were retrieved from the NCBI genome database for comparative analysis. Orthologs were identified using OrthoFinder v2.2.7 (Emms & Kelly 2015) under default parameters. MAFFT v7.450 (Katoh & Standley 2013) was used to align amino acid sequences of 1:1:1 orthologous gene with the G-INS-I algorithm. The phylogenetic tree was inferred using an approximately-maximum-likelihood method implemented in FastTree v2.1.10 (Price et al. 2009). We used r8s (Sanderson 2003) to estimate the divergent times among species with divergence times of two nodes, i.e. Tribolium castaneum and Anoplophora glabripennis (Wang et al. 2019), Trichoplusia ni and S. litura (Wan et al. 2019) as calibrations. The Computational Analysis module of gene Family Evolution (CAFE) version 3.1 (Bie et al. 2006) was used to analyze gene family expansion and contraction.
To explore possible genomic components related to environmental adaption in the OFM as well as the other tortricid moth, C. pomonella, we manually annotated detoxification genes, chemosensory genes, and heat shock proteins (HSP) genes and compared these gene families among 13 representative genomes of Lepidoptera. The detoxification genes include five families of cytochrome P450 monooxygenases (P450s), glutathione-s transferases (GSTs), ATP-binding cassette transporters (ABCs), UDP-glycosyltransferases (UGTs), and carboxyl/cholinesterases (CCEs). The chemosensory genes include four families of olfactory receptors (ORs), gustatory receptors (GRs), Ionotropic receptors (IRs), and odorant-binding proteins (OBPs).
We used both model-based and similarity-based methods to annotate these gene families. For model-based identification, the Hidden Markov models (HMMs) were downloaded from Pfam 32.0 database (September 2018; (El-Gebali et al. 2018)) and run with HMMER v3.3 (Finn et al. 2011). The corresponding HMM model not found in the Pfam database was manually trained using HMMER under the default parameters. For similarity-based identification, we used orthologs from D. melanogaster, B. mori, Aedes aegypti, Anopheles gambiae, and C. pomonella to search against target genomes using BLAST v2.2.31 (Altschul et al. 1990) with an e-value cutoff of 1e-5. An automatic BITACORA v1.0 (Vizueta et al. 2020) pipeline (full mode) was used to conduct the HMMER and BLAST analyses. The annotated genes were filtered manually based on gene length and the presence of conserved domains by removing genes shorter than 80 amino acids and those lacking conserved domains.
Genotyping populations across the native range
For genome resequencing of three representative populations, individual DNA libraries with an insert size of 400 bp were constructed and sequenced on the Illumina HiSeq X Ten platform to obtain 2×150 bp paired-end reads. A sequence depth of approximately 36-fold was obtained for each sample. After filtering out raw sequencing reads containing adapters and reads of low quality, the remaining clean reads were mapped to the reference assembly using BWA v0.7.17 with default parameters (Li & Durbin 2009). SAMtools v1.9 (Li et al. 2009) was used to sort reads and remove mapping quality lower than 30. Single-nucleotide polymorphism (SNP) calling was performed using the Genome Analysis Toolkit (GATK) v3.5 (McKenna et al. 2010). The criteria used to filter the raw SNPs were “QD < 2.0, FS > 60, SOR > 4.0, MQ < 40”. SNPs were further filtered using the R package vcfR (Knaus & Grünwald 2017) and VCFtools v0.1.16 (Danecek et al. 2011) with the following criteria: SNPs with a sequencing depth lower than four and higher than 500 were removed; SNPs with a missing rate higher than 10% were removed. All the SNPs were annotated with SnpEff v4.3 (Cingolani et al. 2012).
For genotyping the other 12 populations of OFM, we used the KASP, a fluorescence-based method. This method is suitable for detecting flexible numbers of SNPs with low cost and high accuracy. We selected 22 SNPs on 9 outlier genes for genotyping (see below for outlier selection). Primers for each SNP were designed based on the genome sequence using SNP_Primer_Pipeline2 (https://github.com/pinbo/SNP_Primer_Pipeline2). The KASP reactions were conducted in 1,536 microplates with 1 μl final reaction volumes containing 1-5 ng genomic DNA, 0.5 μl of KASP 2x Master Mix (KBS-1016-011, LGC Genomics Ltd. Hoddesdon, UK), 0.5 μl of ultrapure water, 0.168 pmol of each allele-specific forward primer and 0.42 pmol reverse primer. The amplifications were conducted on an LGC Hydrocycler high‐throughput thermal cycler (LGC Genomics Ltd. Hoddesdon, UK) using the following program: 95 °C for 15 min; 10 cycles of 94 °C for 20s and touchdown starting at 61 °C for 60s until to 55 °C, then 26-42 cycles of 94 °C for 20s and 55 °C for 60s. The fluorescence was scanned by Pherastar (LGC Genomics Ltd. Hoddesdon, UK), and the genotyping results were visualized and output files generated using Kraken software (LGC Genomics Ltd. Hoddesdon, UK). Due to unavailable KASP amplication of the Clk gene on the neo-Z chromosome, a pair of PCR primers (Clk_24750581_F, GCYAAWTATGCCAATTCCAA; Clk_24751090_R, ACTGTTTGCAGCGACCTACC) was designed for Sanger sequencing of the target region. PCR amplication and sequencing followed the method for mitochondrial genes (Song et al. 2018; Wei et al. 2015).
Demographic and population genetic diversity analyses
We used SMC++ v1.15.4 (Terhorst et al. 2017) to analyze the historical effective population size spanning 4 X 106 generations based on SNPs of autosomes. We assumed a mutation rate of 2.9 × 10−9 per site per generation as estimated for Heliconius (Lepidoptera: Hesperiidae) (Keightley et al. 2015), a generation time of 0.25 years.
K-mers were counted by jellyfish v 2.2.10 with a 17-base oligonucleotide. Genome size, heterozygosity, and rate of duplication were estimated for each individual using GenomeScope v1.0 (Vurture et al. 2017). Pairwise genetic differentiation (FST) and nucleotide diversity (π) of the OFM populations were calculated using VCFtools v0.1.16 (Danecek et al. 2011) with a window size of 50 kb and widow step of 10 kb. FST and nucleotide diversity (π) were compared. The decay of LD against physical distance for the different populations was calculated using PopLDdecay (Zhang et al. 2019a) with default parameters.
Genome structrual variation analysis
The program lostruct (local PCA/population structure, v.0.0.0.9) was used to detect chromosomal inversions from genome resequencing data (Li & Ralph 2019) . Lostruct divides the genome into non-overlapping windows and calculates a PCA for each window. It then compares the PCAs derived from each window and calculates a similarity score. The matrix of similarity scores is then visualized using a multidimensional scaling (MDS) transformation. Lostruct was run with 100 SNP-wide windows and independently for each chromosome. Each MDS axis was then visualized by plotting the MDS score against the position of each window in the chromosome. We identified recombination suppression regions following (Todesco et al. 2020). Briefly, we manually selected the potential regions based on an MDS axis and minimum or maximum value that included windows within the region, but excluded the rest of the chromosome. All SNPs within the regions defined by MDS scores were used to calculate PCAs using R to test whether they could divide samples into three groups representing 0/0, 0/1 and 1/1 genotypes. Average heterozygosity and LD within the regions were also calculated to test whether they were higher than the rest of the chromosome.
Genome scanning of genes potentially under selection
We conducted genome scans to identify outliers potentially related to the refugial population. The genome-wide distribution of the fixation index (FST) and π between populations were calculated using VCFtools v0.1.16 (Danecek et al. 2011) with a window size of 50 kb and window step of 10 kb. The combination of the top 5% of FST values and the top 5% of π ratios between population pairs were considered as selective sweep regions. Since π and FST values varied among chromosomes (see Results), each chromosome was analyzed independently. To detect outliers related to northward dispersal of OFM, we used population pairs of refuge / north, i.e. SCCD / BJPG and YNHH / BJPG. The intersection of these two combination were considered as regions involving spatial sorting during northward dispersal. Missense SNPs with maf > 0.05 in these regions were extracted based on annotation results of SnpEff v4.3 analysis. These missense SNPs with higher FST values between BJPG and other two refuge population than between YNHH and SCCD populations were retained as outlier SNPs. To detect outliers related to endemicity of SCCD population, we used π ratios and FST values of BJPG / SCCD and YNHH / SCCD, and filtered outlier regions, SNPs and genes as above.
We used Bedtools v2.2.80 (https://bedtools.readthedocs.io/en/latest/) to identify genes near candidate SNPs (± 1 bp). We performed functional enrichment analyses using the clusterProfiler toolkit (Yu et al. 2012), where the significance level was set at 0.05 and the P-value was corrected using the Benjamini-Hochberg FDR.