Skip to main content
Dryad logo

Faster‐haplodiploid evolution under divergence‐with‐gene‐flow: Simulations and empirical data from pine‐feeding hymenopterans


Bendall, Emily; Linnen, Catherine; Bagley, Robin K.; Sousa, Vitor C. (2022), Faster‐haplodiploid evolution under divergence‐with‐gene‐flow: Simulations and empirical data from pine‐feeding hymenopterans, Dryad, Dataset,


Although haplodiploidy is widespread in nature, the evolutionary consequences of this mode of reproduction are not well characterized. Here, we examine how genome-wide hemizygosity and a lack of recombination in haploid males affects genomic differentiation in populations that diverge via natural selection while experiencing gene flow. First, we simulated diploid and haplodiploid “genomes” (500-kb loci) evolving under an isolation-with-migration model with mutation, drift, selection, migration, and recombination; and examined differentiation at neutral sites both tightly and loosely linked to a divergently selected site. So long as there is divergent selection and migration, sex-limited hemizygosity and recombination cause elevated differentiation (i.e., produce a “faster-haplodiploid effect”) in haplodiploid populations relative to otherwise equivalent diploid populations, for both recessive and codominant mutations. Second, we used genome-wide SNP data to model divergence history and describe patterns of genomic differentiation between sympatric populations of Neodiprion lecontei and N. pinetum, a pair of pine sawfly species (order: Hymenoptera; family: Diprionidae) that are specialized on different pine hosts. These analyses support a history of continuous gene exchange throughout divergence and reveal a pattern of heterogeneous genomic differentiation that is consistent with divergent selection on many unlinked loci. Third, using simulations of haplodiploid and diploid populations evolving according to the estimated divergence history of N. lecontei and N. pinetum, we found that divergent selection would lead to higher differentiation in haplodiploids. Based on these results, we hypothesize that haplodiploids undergo divergence-with-gene-flow and sympatric speciation more readily than diploids.


DNA was extracted and ddRAD libraries were prepared for 23 Neodiprion pinetum larvae and 44 N. lecontei larvae collected from multiple locations in Kentucky, as well as 4 interspecific hybrids and an additional 18 N. lecontei samples from an allopatric population in Michigan. These libraries were sequenced using 150-bp paired end reads on an Illumina HiSeq 4000. We also collected whole-genome resequncing data from  a N. virginiana sample to be used as an outgroup in demographic analyses. Neodiprion sequencing reads are available via the NCBI SRA, accession numbers: SAMN23893940-SAMN23893944, SAMN23893948, SAMN23893960-SAMN23893963, SAMN23893965, and SAMN25157024-SAMN25157101.

We aligned demultiplexed ddRAD reads to the N. lecontei reference genome (Nlec1.1 GenBank assembly accession number- GCA_001263575.2) using the very sensitive setting in bowtie2. We only retained reads that aligned to one locus in the reference genome and had a Phred score greater than 30. For the ddRAD dataset, we removed PCR duplicates using a custom script. We called SNPs in samtools. We required all sites to have a minimum of 7x coverage and 50% missing data or less. We also removed SNPs with significantly more heterozygotes than expected under Hardy-Weinberg equilibrium (an indicator of genotyping/mapping error). We removed any individual that was missing more than 70% of the data. We performed all filtering in VCFtools v0.1.13.

We created several datasets with subsets of individuals and additional filtering for each of the population genetic analyses. We generated three data sets with minor allele filtering (MAF, SNPs <0.01 removed): 1) sympatric N. pinetum and N. lecontei for genome-wide patterns of divergence (36,935 SNPs), 2) sympatric N. pinetum, N. lecontei, and hybrids for admixture analysis (35,649 SNPs), and 3) sympatric N. pinetum, N. lecontei, allopatric N. lecontei, and outgroup N. virginiana for ABBA- BABA tests (12,905 SNPs). We also generated a down-sampled dataset (described below) without a MAF filter for estimating site-frequency spectra (SFS) that included sympatric N. pinetum, N. lecontei, and N. virginiana for demographic analyses.

Usage Notes

ZIP files contains all input files (dryad) and scripts (zenodo) needed to run a set of population genetic analyses (plus a README file). 


Data and scripts used to perform the ABBA-BABA tests (D-statistic), including the following files:

- data_files folder: VCF and individual info files

-  bash script to filter the VCF file based on DP and HWE and obtain a genotype matrix file
- analysis_NeoLecPin_feb2017.r:  R script to read the genotype matrix file and compute the ABBA-BABA tests (D-statistics) and compute their significance using block-jackniffe
- RScripts folder: files with definition of functions used by the bash script and R script analysis_NeoLecPin_feb2017.r


Data and scripts to run the admixture analysis for the hybrids, N. pinetum and N.lecontei in Kentucky, including the following files:

- KY_Pin_F1_nohet_7x_0.5miss_0.01maf.recode.vcf is the vcf file containing the SNPS used in this analyses
- KY_pin_F1_sitesincommon.ped and are the input files for this analysis

- is the script to run the admixture analysis


Data and scripts to perform demographic analyses of Neodiprion sawflies using fastsimcoal2

- Define_NeutralFstThreshold folder: outputfiles to obtain the threshold FST based on simulations
- input_data folder: contains vcf files that are conversted to 2DSFS, the 2DSFS files, and the fastsimcoal input files

- Build_2DSFS folder: scripts  to obtain the 2D SFS from VCF files
- Define_NeutralFstThreshold folder: scripts to obtain the threshold FST based on simulations
- ScriptsLaunchFastsimcoal2 folder: scripts to launch fastsimcoal2 analyses


Data for calculating pi and Fst for N. lecontei and N. pinteum

- KY.text and Pintetum.txt contain the sample names for N. lecontei and N. pinetum respectively
- for all files in this folder KY refers to the Kentucky population of N. lecontei
- KY_Pin_7x_50%miss_nothetexc_0.01maf.recode.vcf is the input vcf file for all three analyses
- The .log files contain the command line code that was used to run the three analyses
- .pi and .fst files are the output files containing the raw pi and Fst values


Data and scripts to go from raw fastq reads to vcf files. 

- Deduplication folder contains fastq_files folder with one fastq file per lane per library for removal of PCR duplicates

- Deduplication folder contains Python_scripts folder includes the python script for removal of PCR duplicates. There is a single python script for each lane and library.
- has the command line input to go from fastq files to BAM files. The output is a single filtered BAM filer per sample


National Science Foundation, Award: DEB-CAREER-1750946

National Science Foundation, Award: DEB-1257739

National Institute of Food and Agriculture, Award: 2015-67011-22803

Fundação para a Ciência e a Tecnologia, Award: UIDB/00329/2020

Fundação para a Ciência e a Tecnologia, Award: CEECIND/02391/2017

Fundação para a Ciência e a Tecnologia, Award: CEECINST/00032/2018/CP1523/CT0008

HORIZON EUROPE Marie Sklodowska-Curie Actions, Award: 799729

Fundação para a Ciência e a Tecnologia, Award: CPCA/A0/7303/2020

National Science Foundation of Sri Lanka, Award: DEB‐1257739

National Science Foundation of Sri Lanka, Award: DEB‐CAREER‐1750946

National Institute of Food and Agriculture, Award: 2015‐67011‐22803

Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, Award: CPCA/A0/7303/2020

European Commission, Award: 799729