Recurrent selection shapes the genomic landscape of differentiation between a pair of host-specialized haplodiploids that diverged with gene flow

Glover, Ashleigh 1 ; Sousa, Vitor2; Ridenbaugh, Ryan1; Sim, Sheina3; Geib, Scott3; Linnen, Catherine1

Published Aug 22, 2024 on Dryad. https://doi.org/10.5061/dryad.fxpnvx128

Data files

Aug 22, 2024 version files 532.32 MB

MolEcol_scripts.zip

532.31 MB
README.md

5 KB

Abstract

Understanding the genetics of adaptation and speciation is critical for a complete picture of how biodiversity is generated and maintained. Heterogeneous genomic differentiation between diverging taxa is commonly documented, with genomic regions of high differentiation interpreted as resulting from differential gene flow, linked selection, and reduced recombination rates. Disentangling the roles of each of these non-exclusive processes in shaping genome-wide patterns of divergence is challenging but will enhance our knowledge of the repeatability of genomic landscapes across taxa. Here, we combine whole-genome resequencing and genome feature data to investigate the processes shaping the genomic landscape of differentiation for a sister-species pair of haplodiploid pine sawflies, Neodiprion lecontei and Neodiprion pinetum. We find genome-wide correlations between genome features and summary statistics are consistent with pervasive linked selection, with patterns of diversity and divergence more consistently predicted by exon density and recombination rate than the neutral mutation rate (approximated by dS). We also find that both global and local patterns of FST, dXY, and π provide strong support for recurrent selection as the primary selective process shaping variation across pine sawfly genomes, with some contribution from balancing selection and lineage-specific linked selection. Because inheritance patterns for haplodiploid genomes are analogous to those of sex chromosomes, we hypothesize that haplodiploids may be especially prone to recurrent selection, even if gene flow occurred throughout divergence. Overall, our study helps fill an important taxonomic gap in the genomic landscape literature and contributes to our understanding of the processes that shape genome-wide patterns of genetic variation.

https://doi.org/10.5061/dryad.fxpnvx128

Description of the data and file structure

To perform all analyses, run scripts provided in each folder in the order indicated in the folder name. Each folder has its own README file providing details specific to that analysis.

NOTE: for all bash scripts (*.sh), you will have to modify them to run on your cluster. However, the program commands should work as long as you are using the same version of the program (program versions indicated in the Materials and Method section of the manuscript).

Files and variables

File: MolEcol_scripts.zip

Description:

In the first folder ("01.get_windowed_statistics"), each summary statistic and genome feature has a separate folder containing all scripts and input files necessary to generate the statistic. Subfolder descriptions:

"Fst" = scripts (*sh) and input files ("Input_files" folder) to calculate windowed Fst
"nucleotide_diversity" = scripts (*.sh) and input files ("Input_files" folder) to calculate windowed π, Tajima's D, and Fay & Wu's H for each species separately
"dxy" = scripts (*sh, *.R) and input files ("Input_files" folder) to calculate windowed dxy
"recombination_rate" = estimates of recombination rate (cM/Mb) for each 50kbp window from Linnen et al. 2018, Herrig et al. 2024
"exon_density" = script ("get_exon_density.R") and input gff file ("GCF_021901455.1_iyNeoLeco1.1_genomic.gtf") for estimating exon density in each 50kbp window
"mutation_rate" = scripts and input files to calculate windowed synonymous substitution rate (dS; proxy for neutral mutation rate)
"distance_from_centromere" = all scripts required to perform contig assembly and HiC scaffolding (*sh, "mosdepth.slurm") and analyze output/generate plots to estimate the midpoint of the centromere (bp) for each chromosome of the Neodiprion lecontei reference genome ("neodiprion_lecontei_chromosomes.Rmd")

After generating each statistic above, proceed to folder "02.combine_windowed_datasets". This folder contains all input files ("Input_files" folder) and script ("combine_windowed_datasets.R") to combine all individual outputs from folder one into a single data sheet for downstream analyses. Column descriptions of output compiled data set:

chr = chromosome
midPos = mid-position of the window (bp)
window_start = start position of the window (bp)
window_stop = end position of the window (bp)
nSites_fst = number of called sites used to calculate Fst in the window
fst = Fst estimate
nSites_dxy = number of called sites used to calculate dxy in the window
avg.dxy = average dxy estimate
nSites_lecontei = number of called sites used to calculate diversity and selection statistics (π, Tajima's D, Fay & Wu's H) for Neodiprion lecontei
pi.lecontei = π estimate for N. lecontei
Tajima_lecontei = Tajima's D estimate for N. lecontei
fayh_lecontei = Fay and Wu's H estimate for N. lecontei
nSites_pinetum = number of called sites used to calculate diversity and selection statistics (π, Tajima's D, Fay & Wu's H) for N. pinetum
pi.pinetum = π estimate for N. pinetum
Tajima_pinetum = Tajima's D estimate for N. pinetum
fayh_pinetum = Fay and Wu's H estimate for N. pinetum
recomb.rate = recombination rate (cM/Mb)
exon.density = exon density
lecontei_ds = dS estimate (proxy for mutation rate) for N. lecontei
pinetum_ds = dS estimate (proxy for mutation rate) for N. pinetum
mean_ds = average dS estimate of N. lecontei and N. pinetum
mean.pi = average π estimate of N. lecontei and N. pinetum
dist_from_cent = distance from the centromere

All missing values are indicated with "NA".

The third folder ("03.Pearsons_correlations") contains the input file ("combined_statistics_50kbp_filtered.csv") and script ("Pearsons_correlations.R") necessary to complete genome wide correlations (Pearson's correlation coefficients and significance tests) between all pairs of summary statistics and genome feature measures.

The fourth folder ("04.Multiple_regression") contains the input file ("combined_statistics_50kbp_filtered.csv") and script ("multiple_regression.R") necessary to complete all multiple linear regression models.

The last folder ("05.local_patterns") contains the input file ("combined_statistics_50kbp_filtered.csv") and script ("local_patterns.R") necessary to identify windows that exhibit patterns of Fst, dxy, and π matching one of the four evolutionary scenarios (divergence-with-gene-flow, recurrent selection, balancing selection, allopatric selection).

Code/software

Each analysis contains a README file describing how to run all scripts. If any modifications are required to run the scripts, these are described in the README files.

We collected Neodiprion lecontei and N. pinetum mid- to late-instar larval colonies from Lexington, Kentucky and surrounding areas. To confirm sex (and therefore ploidy), we reared the larvae to adults in the lab using standard lab protocols. The adults were either preserved in 100% ethanol (stored at -20 °C) or flash frozen and stored at -80 °C. To avoid sampling close relatives, we selected one individual from each larval colony (each colony typically represents a group of siblings). In total, we sampled 20 N. lecontei females and 18 N. pinetum females. We extracted DNA from head and thorax tissue with a Qiagen DNeasy Blood & Tissue Kit. We followed the standard manufacturer protocol for insects, including an optional RNase A step. DNA concentration was measured with a Quant-iT dsDNA High-Sensitivity fluorescence assay (Invitrogen). A single library was prepared using a Tn5 tagmentation protocol following Bendall (2020). Whole-genome resequencing was performed with 150bp paired-end sequencing technology in an Illumina Hi-Seq X sequencer at Admera Health (Plainfield, NJ). The library was first run in a single lane of the sequencer. A subset of samples that had low read counts were then re-pooled and run on a second lane.

Demultiplexed reads were cleaned using trimmomatic v0.39 with the following criteria: (a) remove adapters, (b) perform sliding window trimming where a sequence is cut when a window (4bp) drops below a quality threshold (15), (c) remove low quality bases (quality score < 3) from the beginning and end of each read. These trimmed reads are published in the NCBI SRA database (BioProject accession number PRJNA1107580). Then, for each lane separately, the cleaned reads were mapped to the high-quality chromosome-level N. lecontei reference genome (iyNeoLeco1.1, GCA_021901455.1) using the BWA-MEM algorithm in bwa v0.7.17. We used samtools v1.13 to mark PCR duplicates (‘markdup’) and remove ambiguously mapped reads/secondary alignments (‘view -F 1284 -f 0x02’). We then used samtools to merge the filtered reads from the two lanes (‘merge’) and index the resulting bam files (‘index’).

All genetic summary statistics were calculated in ANGSD v0.933. Recombination rates (cM/Mb) were obtained from a previously published high density linkage map generated from a cross between divergent N. lecontei populations (Linnen et al. 2018; Herrig et al. 2024). We used the Neodiprion lecontei genome annotation (GCF_021901455.1_iyNeoLeco1.1_genomic.gtf; Herrig et al. 2024) to calculate exon density in 50kbp windows. To estimate mutation rate, we calculated the synonymous substitution rate (dS) in 50kbp windows. To estimate distance from the centromere, we first used Juicebox v1.11.08 to visualize the HiC contacts for each chromosome and estimated the midpoint of each centromere by identifying the local maximum delta in the number of contacts between adjacent loci within each chromosome. We then calculated the distance of each window from the centromere by taking the absolute value of the midpoint of each window subtracted from the midpoint of the centromere.