Recent climate change and historical population structure predict spatial patterns of admixture between two host-specialized pine sawfly species

Glover, Ashleigh 1 ; Linnen, Catherine 1

Published Nov 24, 2025 on Dryad. https://doi.org/10.5061/dryad.kh18932jw

Data files

Nov 24, 2025 version files 42.62 GB

Glover_Linnen_dryad.zip

42.62 GB
README.md

10.57 KB

Abstract

Human disturbance can have profound effects on biodiversity, including increasing hybridization between reproductively isolated species. One approach for understanding how human activity affects hybridization dynamics is to evaluate correlations between disturbance (e.g., urbanization, temperature change) and hybridization. Because variation in hybridization can also arise from historical factors unrelated to recent human disturbance, it is essential to account for population structure to avoid spurious correlations. Here, we combine environmental and high-coverage whole-genome resequencing data to investigate how human disturbance and population structure affect hybridization dynamics between a pair of pine sawflies adapted to different pines, Neodiprion lecontei and Neodiprion pinetum. We find that N. lecontei and N. pinetum exhibit strikingly different patterns of population structure, which we hypothesize stems from differences in host. We also find that recent admixture is both asymmetric and geographically variable. Linear regression analyses reveal that admixture proportion is predicted by indirect human disturbance (i.e., climate change) and not direct human disturbance (e.g., urbanization) in both N. lecontei and N. pinetum. Lastly, in N. pinetum, we find evidence of a spurious association between admixture and direct human disturbance that disappears when regression models account for population structure via inclusion of genetic principal component scores as covariates. Together, our data suggest that indirect human disturbance and population structure both contribute to geographic variation in admixture between N. lecontei and N. pinetum. Our study also highlights the importance of adequately controlling for population structure when attempting to identify environmental predictors (human disturbance-related or not) of hybridization.

Dataset DOI: 10.5061/dryad.kh18932jw

Description of the data and file structure

To perform all analyses, run scripts provided in the order indicated in the folder/script name (i.e., "01*" to "07*").

NOTE: for all bash scripts (*.sh), you will have to modify them to run on your cluster. However, the program commands should work as long as you are using the same version of the program (program versions indicated in the Materials and Method section of the manuscript).

Files and variables

File: Glover_Linnen_dryad.zip

Description:

Input files required to run analyses are provided:

"allsites_LecPineHetr_filtered_ExclIndvs_minQ20_AllImbHom_refilter_bcftools_norm.vcf.gz" - required for ABBA-BABA analysis ("06_ABBA-BABA" folder); vcf file contains variant and invariant sites for all N. lecontei, N. pinetum, and N. hetricki individuals included in analysis
"snps_LecPine_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for ADMIXTURE ("03_ADMIXTURE" folder) and principal component analysis (PCA) for both species combined ("04_run_pcas.sh" script) analyses; vcf file contains linkage-pruned SNPs for all N. lecontei and N. pinetum individuals included in analyses (i.e., contaminated/putative haploid male individuals removed)
"snps_LecOnly_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for N. lecontei only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all N. lecontei individuals included in analysis (i.e., contaminated/putative haploid male individuals removed)
"snps_PineOnly_noHybrid_filtered_ExclIndvs_MAF05_minQ20_AllImbHom_refilter_pruned_bcftools_norm.vcf.gz" - required for PCA for *N. pinetum *only analysis ("04_run_pcas.sh" script"); vcf file contains linkage-pruned SNPs for all *N. pinetum *individuals included in analysis (i.e., contaminated/putative haploid male individuals and F1 hybrid called N. pinetum at time of collection due to being collected on white pine removed)
"fulldata_LecPine.csv", "fulldata_LecOnly.csv", "fulldata_PineOnly.csv" - full/compiled data for all N. lecontei and N. pinetum individuals together, all N. lecontei individuals, and all N. pinetum individuals included in downstream analyses, respectively. These files are required to run the "05_plot_PCAs_ADMIXTURE.R" and "07_run_models.R" scripts.

Description of columns in the *.csv files (#5) above (each row represents an individual sawfly):

"ind" = unique individual identifier assigned by the sequencing center (Admera Health)
"genPC1_*" = genetic PC1 score from PCA of both species combined ("genPC1_LecPine" in "fulldata_LecPine.csv" file), N. lecontei only ("genPC1_LecOnly" in "fulldata_LecOnly.csv" file), or N. pinetum only ("genPC1_PineOnly" in "fulldata_PineOnly.csv" file)
"genPC2_*" = genetic PC2 score from PCA of both species combined ("genPC2_LecPine" in "fulldata_LecPine.csv" file), N. lecontei only ("genPC2_LecOnly" in "fulldata_LecOnly.csv" file), or N. pinetum only ("genPC2_PineOnly" in "fulldata_PineOnly.csv" file)
"genPC3_*" = genetic PC3 score from PCA of both species combined ("genPC3_LecPine" in "fulldata_LecPine.csv" file), N. lecontei only ("genPC3_LecOnly" in "fulldata_LecOnly.csv" file), or N. pinetum only ("genPC3_PineOnly" in "fulldata_PineOnly.csv" file)
"AG_ID" = unique individual identifier assigned by the Linnen lab
"avg_perc_PP_reads_mapped" = percentage of properly paired reads mapped; averaged between the two sequencing batches
"MERGED_FINAL_read_count" = total read count across two sequencing batches
"species" = species identity
"city" = city individual was collected in
"state" = state individual was collected in
"area" = geographic area ("North" or "Central") that individual was assigned to based on ADMIXTURE ("03_ADMIXTURE" folder) and PCA ("04_run_pcas.sh" script) analyses
"latitude" = latitude individual was collected at
"longitude" = longitude individual was collected at
"host" = pine host individual was collected on
"stage" = life stage of individual at time of DNA extraction, either "larva" or adult female ("adultF")
"lcPC1" = PC1 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder)
"lcPC2" = PC2 score from PCA of land cover proportions (from analyses in "01_LandCover_data" folder)
"change_tmax" = change in average daily maximum temperature from the 1980's to the 2010's in degrees Celsius (from analyses in "02_temperature_data" folder)
"site_number" = arbitrary number assigned to sampling site; each unique sampling site (i.e., unique latitude/longitude GPS coordinates) was assigned an arbitrary number (1-189)
"grouped_site_number_5km" = grouped site number, where all sampling sites that were within 5km of each other were considered a “grouped site” and were assigned an arbitrary grouped site number
"allele_imbalance" = percentage of heterozygote calls displaying allele balance ratios less than 0.3
"avg_depth_coverage" = average depth coverage across all linkage-pruned SNPs
"missingness" = proportion of sites with a missing genotype
"heterozygosity" = proportion of sites with a heterozygous genotype
"pop" = population label for making ADMIXTURE plots ("05_plot_PCAs_ADMIXTURE.R" script); [species]_[sampling state]
"admix_prop_k2" = admixture proportion for K=2 (i.e., species-level admixture)
"lec_blue_k2" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=2
"pine_orange_k2" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to N. pinetum ancestry across all 100 ADMIXTURE runs for K=2
"lec_blue_k3" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=3
"pine_orange_k3" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to N. pinetum ancestry across all 100 ADMIXTURE runs for K=3
"lec_purple_k3" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=3
"lec_blue_k4" = final assignment probability for "blue" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=4
"pine_orange_k4" = final assignment probability for "orange" genetic group (color assigned by CLUMPAK) that corresponds to N. pinetum ancestry across all 100 ADMIXTURE runs for K=4
"lec_purple_k4" = final assignment probability for "purple" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=4
"lec_green_k4" = final assignment probability for "green" genetic group (color assigned by CLUMPAK) that corresponds to N. lecontei ancestry across all 100 ADMIXTURE runs for K=4

Perform analyses:

The first folder ("01_LandCover_data") contains the R script to calculate land cover proportions within a 5km-radius buffer around each site's GPS coordinates ("get_LandCover_props.R"). The required input land cover raster files to run this script are available in the public domain at (https://www.mrlc.gov/viewer/) for the United States and (http://www.cec.org/north-american-environmental-atlas/land-cover-30m-2020/) for Canada. This folder also contains the input files ("for_lcPCA.csv", "latlong.csv") and script ("LandCover_PCA.R") necessary to complete the PCA of land cover proportions for each of the 189 sampling sites. Specifically, the "for_lcPCA.csv" file contains, for each sampling site, the proportions of each land cover class within a 5km radius buffer around the GPS coordinates of the sampling site; the "latlong.csv" file contains the GPS coordinates (latitude and longitude) of each sampling site.

The second folder ("02_temperature_data") contains the input file ("LatLong.csv") and script ("get_tmax.R") necessary to extract the average of daily maximum temperature for each month within each year (1980-1989 and 2010-2019) for the latitude/longitude that each individual sawfly was collected at. The required input temperature raster files to run this script are available in the public domain at (https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=2131) for the United States and Canada.

The third folder ("03_ADMIXTURE") contains the script ("run_admixture.sh") to perform 100 ADMIXTURE runs for each K (1-10). The output file from CLUMPAK for K=2 is also provided ("ClumppIndFile.output"). In this text file, each row represents an individual; the proportion of the individual's ancestry originating from each of the two inferred ancestral populations are provided (last two columns), from which admixture proportions can be extracted.

The script to perform the next analysis (PCA) is provided ("04_run_pcas.sh").

Next, the script to plot the results from the ADMIXTURE and PCA analyses above is provided ("05_plot_PCAs_ADMIXTURE.R").

The next folder ("06_ABBA-BABA") contains the scripts (".sh") and input files (".txt" and ".nwk") to perform the ABBA-BABA analysis for the "North" and "Central" geographic regions separately. For each region, the ".txt" file indicates which of the four taxa each individual belongs to (i.e., P₁, P₂, P₃, or Outgroup) - individuals that are in the vcf file but excluded from the analysis are indicated with an "xxx". The ".nwk" file is a text file that represents the phylogenetic tree of the four taxa above in the Newick format (i.e., uses parentheses and commas to show the relationships between taxa).

Finally, the script to perform the multiple linear regression models and plot the model results is provided ("07_run_models.R").

Code/software

All scripts to perform analyses are described above. All software and R package versions are described in the Materials and Methods section of the manuscript.