Skip to main content

Migration trajectories of the diamondback moth Plutella xylostella in China inferred from population genomic variation

Cite this dataset

Chen, Ming‐Zhu et al. (2023). Migration trajectories of the diamondback moth Plutella xylostella in China inferred from population genomic variation [Dataset]. Dryad.



The diamondback moth (DBM), Plutella xylostella (Lepidoptera: Plutellidae), is a notorious pest of cruciferous plants. In temperate areas, annual populations of DBM originate from adult migrants. However, the source populations and migration trajectories of immigrants remain unclear. Here, we investigated migration trajectories of DBM in China with genome-wide single nucleotide polymorphisms (SNPs) genotyped using double-digest RAD (ddRAD) sequencing. We first analyzed patterns of spatial and temporal genetic structure among southern source and northern recipient populations, then inferred migration trajectories into northern regions using discriminant analysis of principal components (DAPC), assignment tests and spatial kinship patterns.


Temporal genetic differentiation among populations was low, indicating sources of recipient populations and migration trajectories are stable. Spatial genetic structure indicated three genetic clusters in the southern source populations. Assignment tests linked northern populations to the Sichuan cluster, and central-eastern populations to the South and Yunnan clusters, indicating that Sichuan populations are sources of northern immigrants and South and Yunnan populations are sources of central-eastern populations. First-order (full-sib) and second-order (half-sib) kin pairs were always found within populations, but about 35-40% of third-order (cousin) pairs were found in different populations. Closely related individuals in different populations were in about 35-40% of cases found at distances of 900 to 1500 km, while some were separated by over 2000 km.


This study unravels seasonal migration patterns in the DBM. We demonstrate how careful sampling and population genomic analyses can be combined to help understand cryptic migration patterns in insects.


Specimen collection and DNA extraction

DBM were sampled from potential source population locations in the annual breeding area of southern China. DBM were collected from cabbage and oilseed rape fields, and all sampling was completed before the first observations of DBM in northern China between March and May 1, 2. In order to reduce the likelihood of sampling siblings within populations, third- and fourth-instar larvae of DBM were collected from about 20 sites at each sampling location, each at least 10 m apart. Putative immigrant male adults were collected in northern China by sex pheromone trapping before the presence of first-generation larvae. Trapping of male DBM was conducted in unplanted fields with no greenhouses within 500 m, to reduce the likelihood of trapping individuals overwintering in protected conditions. The distance between traps was at least 50 m. The development of one generation of DBM takes about 30 days in early spring 3. This strategy therefore restricted sampling of genetically related individuals to within three generations between source and recipient populations, and reduced the influence of genomic admixture between immigrants from different sources. This sampling was conducted in 2017 and again in 2018, to examine annual variation in migratory trajectories and temporal variation in population genetic structure. In total, samples were collected from 16 locations in 2017 and 17 locations in 2018, and in 2018 four locations were sampled across multiple months (Fig. 1, Table 1). Twenty individuals from each population (specimens collected at different times from the same location were considered as different populations) were used for genotyping. Genomic DNA for library preparation was extracted from individual specimens using DNeasy Blood and Tissue Kit (Qiagen, Germany).

SNP genotyping

The ddRAD libraries were prepared following a published protocol 4 for identifying SNPs. Briefly, 120 ng of extracted genomic DNA from each sample was digested by the restriction enzymes NlaIII and AciI (New England Biolabs, USA) 5. The 50 μL digestion reaction was run for 3 hours at 37 °C, followed by DNA cleaning using 1.5× volume of AMPure XP beads (Beckman Coulter, USA) instead of a heat kill step. Next, we ligated each sample to adapters barcoded with a combinatorial index at 16 °C overnight in a 40 μL ligation reaction, labeling each population with a 6-bp index and each individual with a unique 9-bp barcode. After ligation, we pooled uniquely barcoded samples into multiplexed libraries. Fragments between 380-540 bp were selected using BluePippin and a 2% gel cassette (Sage Sciences, USA). Finally, the pooled libraries were enriched with 12 amplification cycles on a Mastercycler Nexus Thermal Cycler (Eppendorf, Germany). PCR products were cleaned with 0.8× volume of beads. We used Qubit 3.0 (Life Invitrogen, USA) and Agilent 2100 Bioanalyzer (Agilent Technology, USA) to check the concentration and size distribution of enriched libraries, respectively. Pooled libraries were sequenced on an Illumina HiSeq 2500 platform to obtain 150-bp paired-end reads, at BerryGenomics Company (Beijing, China).

The Stacks v2.3 pipeline 6 was used to call SNPs, linking to the DBM genome (GenBank assembly accession: GCA_000330985.1) as reference 7. FastQC v 0.11.5 was employed to assess read quality and check for adapter contamination 8. Sequence data was demultiplexed and trimmed using process_radtags in Stacks v2.3 6, 9. Low quality reads with a Phred score below 20 were removed as well as any reads with an uncalled base. Reads were trimmed to 140 bp in length. The remaining paired-end reads were aligned to the DBM genome 7 using Bowtie v2.3.5 10.

Output reads for all individuals were imported into Stacks pipeline to call SNPs, requiring a minimum of three identical reads to create a stack. SNPs were called using a maximum likelihood statistical model. Finally, we obtained a catalog with all possible loci and alleles. The exported loci were present in all populations, and in at least 75% of individuals per population. The exported SNPs for populations that were collected in both years were further filtered using the R package vcfR 11 and VCFtools v0.1.16 12 with the following criteria: SNPs with sequencing depth ≤ 3 and in the highest 0.1% depth were removed, as were SNPs with missingness in all samples ≥ 0.05 and those with minimum minor allele count ≤ 20. An additional data matrix was generated by retaining only SNPs separated by at least 500 bp, to reduce linkage among SNPs.

Genetic diversity, population structure and assignment tests

Global population differentiation was estimated using Weir and Cockerham’s FST with 99% confidence intervals (1000 bootstraps) in diveRsity version 1.9.90. Pairwise FST for all population pairs was estimated using GenePop version 4.7.2 13. Discriminant analysis of principal components (DAPC) was performed in the R package adegenet v2.1.1 14, with the optimal number of clusters determined by the Akaike information criterion (AIC).

Assignment tests were performed in assignPOP v1.1.7 15. Source groups of ST (south) and SW (southwest, this group was divided into YN and SC groups in 2018) (see Table 1 and Fig. 1 for locations) were trained using the support vector machine algorithm to build predictive models. For training, we used either 25, 28, or 32 random individuals (2017 samples) or 13, 15 or 17 random individuals (2018 samples) from each group, and loci with the highest 60%, 80% or 100% FST values. Monte-Carlo cross-validation was performed by resampling each training set combination 1000 times. The ratio of assignment probability between the most-likely and second most-likely assigned groups was calculated for each individual 16. When an individual showed an assignment ratio smaller than 2 in more than 30% of the resampling analysis, it was considered unstable and removed in subsequent training. This allowed us to remove individuals from source populations that are not similar enough to other individuals in that source population, thus leaving a set of source populations each comprised of individuals distinctive from those in other populations. Immigrants from the CE (central) and NT (north) regions (see Table 1 and Fig. 1 for locations) were assigned to the trained groups using the support vector machine algorithm.

Kinship analysis

As a complement to assignment tests (but focusing on the individual level rather than the population level), we investigated spatial patterns of kinship within and between populations. Related individuals were identified following the method of Jasper, Schmidt, Ahmad, Sinkins and Hoffmann 17. First, Loiselle’s K was calculated for all individual pairs using SPAGeDi 18 . Kinship coefficients represent the probability that any allele scored in both individuals is identical by descent, with theoretical mean K values for each kinship category as follows: full‐siblings = 0.25, half-siblings = 0.125, full‐cousins = 0.0625, half‐cousins = 0.0313, second-cousins = 0.0156 and unrelated = 0. To allocate pairs of individuals to relatedness categories across three orders of kinship, maximum‐likelihood estimation in the program ML‐Relate 19 was used to identify first‐order (full‐sibling) and second‐order (half‐sibling) pairs. The K scores of pairs within the full‐sibling and half-sibling data sets were used to calculate standard deviations for these categories. Using the theoretical means and standard deviations of K, we randomly sampled 100,000 simulated K scores from each kinship category. In the initial pool of 40755 pairings (2017) and 89676 pairings (2018), ML‐Relate identified 33 (2017) and 36 (2018) full‐sibling and half‐sibling pairs. Assuming that the data contained twice as many first cousin (full and half) pairings as sibling (full and half) pairings, and twice as many second cousin pairings as first cousin pairings, final sampling distributions were developed as follows: 100,000 unrelated, 320 second-cousins, 80 full‐cousins, 80 half‐cousins, 40 half‐siblings, 40 full‐siblings (2017) and 100,000 unrelated, 160 second-cousins, 40 full‐cousins, 40 half‐cousins, 20 half‐siblings, 20 full‐siblings (2018). To analyze how closely this distribution approximates the field data, we randomly sampled 40,000 (2017) and 80,000 (2018) simulated K scores from the above sampling distribution and plotted a histogram of this combined distribution and a histogram of the unrelated distribution against a histogram of 40,000 (2017) and 80,000 (2018) K scores from the empirical data. As the combined distribution matched the empirical distribution much more closely than the unrelated distribution, we adopted it for kinship inference and used it to determine appropriate kinship coefficient cutoffs for third-order (cousin) relationships, following Jasper, Schmidt, Ahmad, Sinkins and Hoffmann 17.

1. Zhu LH, Li ZY, Zhang SF, Xu BY, Zhang YJ, Zalucki MP, Wu QJ and Yin XH, Population dynamics of the diamondback moth, Plutella xylostella (L.), in northern China: the effects of migration, cropping patterns and climate. Pest Manag Sci 74(8): 1845-1853 (2018).

2. Fu X, Xing Z, Liu Z, Ali A and Wu K, Migration of diamondback moth, Plutella xylostella, across the Bohai Sea in northern China. Crop Prot 64(143-149 (2014).

3. Furlong MJ, Wright DJ and Dosdall LM, Diamondback moth ecology and management: problems, progress and prospects. Annu Rev Entomol 58(517-541 (2013).

4. Peterson BK, Weber JN, Kay EH, Fisher HS and Hoekstra HE, Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 7(5): e37135 (2012).

5. Li BY, Gao Q, Cao LJ, Hoffmann AA, Yang Q, Zhu JY and Wei SJ, Conserved profiles of digestion by double restriction endonucleases in insect genomes facilitate the design of ddRAD. Zoological Systematics 43(4): 341-355 (2018).

6. Catchen J, Hohenlohe PA, Bassham S, Amores A and Cresko WA, Stacks: an analysis tool set for population genomics. Molecular Ecology 22(11): 3124-3140 (2013).

7. You MS, Yue Z, He WY, Yang XH, Yang G, Xie M, Zhan DL, Baxter SW, Vasseur L, Gurr GM, Douglas CJ, Bai JL, Wang P, Cui K, Huang SG, Li XC, Zhou Q, Wu ZY, Chen QL, Liu CH, Wang B, Li XJ, Xu XF, Lu CX, Hu M, Davey JW, Smith SM, Chen MS, Xia XF, Tang WQ, Ke FS, Zheng DD, Hu YL, Song FQ, You YC, Ma XL, Peng L, Zheng YK, Liang Y, Chen YQ, Yu LY, Zhang YN, Liu YY, Li GQ, Fang L, Li JX, Zhou X, Luo YD, Gou CY, Wang JY, Wang J, Yang HM and Wang J, A heterozygous moth genome provides insights into herbivory and detoxification. Nature Genetics 45(2): 220-225 (2013).

8. Andrews S. FastQC: a quality control tool for high throughput sequence data (2010).

9. Catchen JM, Amores A, Hohenlohe P, Cresko W and Postlethwait JH, Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences. G3 1(3): 171-182 (2011).

10. Langmead B and Salzberg SL, Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4): 357 (2012).

11. Knaus BJ and Grünwald NJ, vcfr: a package to manipulate and visualize variant call format data in R. Molecular Ecology Resources 17(1): 44-53 (2017).

12. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R and Group GPA, The variant call format and VCFtools. Bioinformatics 27(15): 2156-2158 (2011).

13. Rousset F, genepop'007: a complete re-implementation of the genepop software for Windows and Linux. Mol Ecol Resour 8(1): 103-106 (2008).

14. Jombart T, Devillard S and Balloux F, Discriminant analysis of principal components: A new method for the analysis of genetically structured populations. BMC Genetics 11(94-108 (2010).

15. Chen K-Y, Marschall EA, Sovic MG, Fries AC, Gibbs HL, Ludsin SA and Poisot T, assignPOP: An r package for population assignment using genetic, non-genetic, or integrated data in a machine-learning framework. Methods Ecol Evol 9(2): 439-446 (2018).

16. Schmidt TL, van Rooyen AR, Chung J, Endersby-Harshman NM, Griffin PC, Sly A, Hoffmann AA and Weeks AR, Tracking genetic invasions: Genome-wide single nucleotide polymorphisms reveal the source of pyrethroid-resistant Aedes aegypti (yellow fever mosquito) incursions at international ports. Evol Appl 12(6): 1136-1146 (2019).

17. Jasper M, Schmidt TL, Ahmad NW, Sinkins SP and Hoffmann AA, A genomic approach to inferring kinship reveals limited intergenerational dispersal in the yellow fever mosquito. Mol Ecol Resour 19(5): 1254-1264 (2019).

18. Hardy OJ and Vekemans X, SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Molecular Ecology Notes 2(618-620 (2002).

19. Kalinowski ST, Wagner AP and Taper ML, ML-Relate: a computer program for maximum likelihood estimation of relatedness and relationship. Mol Ecol Notes 6(2): 576-579 (2006).

Usage notes

VCF and genepop format data of SNPs generated in this study;

Scripts used in this study.


Ministry of Science and Technology of the People's Republic of China, Award: 2018YFD0201207