Genomic evidence for correlated trait combinations and antagonistic selection contributing to counterintuitive genetic patterns of adaptive diapause divergence in Rhagoletis flies
Calvert, McCall et al. (2020), Genomic evidence for correlated trait combinations and antagonistic selection contributing to counterintuitive genetic patterns of adaptive diapause divergence in Rhagoletis flies, Dryad, Dataset, https://doi.org/10.5061/dryad.mkkwh70z0
Adaptation to novel environments often results in unanticipated genomic responses to selection. Here, we illustrate how multifarious, correlational selection helps explain a counterintuitive pattern of genetic divergence between the recently derived apple- and ancestral hawthorn-infesting host races of Rhagoletis pomonella (Diptera: Tephritidae). The Apple host race terminate diapause and emerge as adults earlier in the season than the hawthorn host race to coincide with the earlier fruiting phenology of their apple hosts. However, alleles at many loci associated with later emergence paradoxically occur at higher frequencies in sympatric populations of the apple than hawthorn race. We present genomic evidence that historical selection over geographically varying environmental gradients across North America generated genetic correlations between two life history traits, diapause intensity and diapause termination, in the hawthorn host race that are strongly associated with genomic regions in high linkage disequilibrium (LD). These genetic correlations are antagonistic to contemporary selection on local apple host race populations for increased initial diapause depth coupled with earlier, not later, diapause termination. Thus, the paradox of apple flies appears due, in part, to pleiotropy or linkage of alleles associated with later adult emergence with increased initial diapause intensity, the latter trait strongly selected for by the earlier phenology of apples. In contrast, loci associated only with diapause termination and not also initial diapause intensity showed the expected pattern (more early-associated alleles in the apple race) in half of sympatric population pairs surveyed. Our results demonstrate how a more complete understanding of multivariate trait combinations and the correlative nature of selective forces acting on them may generally help improve predictions of the genomics of rapid adaptive evolution and explain seemingly counterintuitive patterns of genetic diversity in nature.
This dataset contains a main directory within which there are three more directories, "src", "data", and "results". "src" contains all of the analysis scripts, "data" contains the data collected for the associated manuscript and used as input for the scripts in "src", and "results" is an empty director that the scripts in "src" will print to. All of the scripts are designed to be run with "main" as the working directory.
Diapause intensity phenotyping
In this study, diapause intensity was measured using stop flow respirometry and emergence timing. The analysis of respirometry data can be found in Diapause_traj.R . This script analyses data from a few different experiments. First it measures the relative frequency of shallow diapause, non diapause, and diapause in each host race. Initial classification to a diapause class were determined by examining the respirometric trajectories of individuals over diapause development (datafile: HawAppleAllTrajectories_R.txt). Next it plots the respirometric trajectories for each host race (datafile: HawAppleAllTrajectories_R.txt). It also plots the respirometric trajectories of the hawthorn flies that were used for the diapause intensity genome-wide association study (datafile: AllLiveFlys_R.txt). Finally, it plots a histogram of emergence time for hawthorns flies that were used for the diapause intensity GWAS (datafile: GlenPreWinterExp2009_R.txt).
The repository contains only the processed vcf files for genomic analysis. The original Fastq files that were used to generate the vcf files can be found on NCBI and the accession numbers will be reported in the associated manuscript. The following scripts can be used to recreate the vcf files in this repo from the Fastq files on NCBI.
MetRat_RAD_processing.sh – This script outlines the workflow to move from raw Fastq files to filtered vcf ready for statistical analysis. It performs alignments using bwa and variant calling using GATK.
sampToSam.pl – This script is referenced in the above workflow. It adds sample information to .sam file; essentially, it adds tags that tell downstream analysis tools (GATK or bcftools) that all reads mapping to one barcode represent a single sample (for sample-specific estimates of SNP genotype, etc.).
ParallelGenotype.pl – This script contains that GATK commands used to call variants and possess functionality for running GATK in parallel.
FilterScatteredGatkVcfs.pl – This script is for collecting the vcf files that the ParallelGenotype.pl script produces and filters them according to a variety of parameters.
All genotyping data was generated using a common RAD-seq approach. The sequencing was performed on Illumina HiSeq machines. Fasta processing, alignment, variant calling, and vcf filtering were all performed according to Egan et al. (2015); see the above scripts. The de novo genome assembly that reads were aligned too was constructed following that of Egan et al. (2015). This “radome” is also available in this repository. If you would like to recreate the vcf files here. You must download the fasta files from NCBI (accessions reported within the accompanying manuscript) and run the genomic processing scripts referenced above.
Each vcf file corresponds to a different experiment that was examined in the associated manuscript.
snps.GATK.all.filtered.vcf – This vcf file contains data that is novel to this study. We genotyped 64 individuals each belonging to three different diapause intensity classes (non diapause (ND), shallow diapause (SD), and diapause (DIA), for a total of 192 individuals. The diapause class information for each individual can be found in the following txt files: complete_diaIds.txt , complete_nonDiaIds.txt , complete_shallowDiaIds.txt.
PJM123.UG.GATK.all.filtered.vcf – Data originally from Extreme phenotyping analysis of genetic associations with diapause termination time in both host races. Individuals were collected from field collected fruit and reared in common garden winter conditions and flies were allowed to emerge post winter. Flies below the 3rd percentile and above the 97th percentile of emergence times were sequenced. The host race and emergence information for each individual can be found in the file appleHawIds.txt .
snps.ScottSelection.05-400-20.vcf – Data originally reported in Egan et al. (2015) and Doellman et al. (2019). Both host races were exposed to a short (7-day) vs. a long (32-day) pre-winter period to simulate prewinter conditions typically experienced by the hawthorn and apple races in nature, respectively, followed by genotypic comparisons pre- and post-selection. The hawthorn race study included a genetic host race comparison at Grant, MI. The host race and pre-winter treatment information for each individual can be found in the file SelectionIdDescriptors.txt .
clinal.UG.GATK.all.good.filtered.vcf – Data originally reported in Doellman et al. (2019). Random samples from the apple and hawthorn races reared from field collected fruit at Fennville, MI, Dowagiac, MI, and Urbana, IL were compared against each other (including previous Grant, MI data) and within each site to assess geographic variation relative to host divergence. The host race and geographic location information for each individual can be found in the file clinal.pops.txt .
clines3_functions.R – This contains the custom functions that employed in the analysis scripts contained in this repo.
Vcf_to_genos.R – This script must be run before any of the genomic analysis scripts can be performed. It first creates an index of which loci should be polarized to ensure that each dataset’s alleles are polarized to the most common alleles in the hawthorn host race at Grant, MI. It also creates an RData file (Mapped_RAD_loci.Rdata) that contains chromosome and LD group information for loci where mapping data is available. Creation of this RData file requires access to an internal database, therefore we have made this RData file available in this repository. Finally, it also creates txt files that contain the mean genotypes for each phenotype/host race across all chromosomes and LD groups. These files are then input into many of the analysis scripts in this repository.
BSLMM_formatting.R – This script produces genotype and phenotype files formatted for the gemma software that is used to run the Bayesian sparse linear mixed model (BSLMM).
bslmm_script.txt – This contains that commands that were passed to gemma to run the (BSLMM)
BSLMM_polygenic_pub.R – This script checks the convergence of the bslmm (note that convergence is poor for some of the parameters due to the relatively small size of these experiments. See manuscript and supplement for more details.). This script also plots models estimates from the bslmm and produces polygenic scores from loci that surpassed a posterior inclusion probability threshold.
AlleleFreqDif_est_v2.R – This scripts performs the allele frequency difference estimation and significance testing through Monte Carlo permutation resampling for all of the genomic datasets contained in the repo.
AlleleFreqDif_Cor_test_DiaVsEcl.R – This script performs estimation of the correlation between allele frequency differences in the diapause intensity GWAS with allele frequency differences in the diapause termination GWAS. It also performs significance testing through a Monte Carlo permutation approach similar to that in AlleleFreqDif_est_v2.R .
AlleleFreqDif_Cor_test_DiaVsSel.R - This script performs estimation of the correlation between allele frequency differences in the diapause intensity GWAS with allele frequency differences in the pre-winter selection experiment. It also performs significance testing through a Monte Carlo permutation approach similar to that in AlleleFreqDif_est_v2.R .
AlleleFreqDif_Cor_test_DiaVsHost.R – This script performs estimation of the correlation between allele frequency differences in the diapause intensity GWAS with allele frequency differences between host races at Grant, MI. It also performs significance testing through a Monte Carlo permutation approach similar to that in AlleleFreqDif_est_v2.R .
AlleleFreqDif_Cor_test_DiaVsGeo.R - This script performs estimation of the correlation between allele frequency differences in the diapause intensity GWAS with allele frequency differences between populations at the northern (Grant, MI) and souther (Urbana, IL) parts of the range. It also performs significance testing through a Monte Carlo permutation approach similar to that in AlleleFreqDif_est_v2.R .
percent_perm.R – This script performs estimation and significant testing for the enrichment of significant SNPs in specific chromosome and LD groups associated with diapause intensity.
x_fold.R – This script forms estimation and significance testing for x-fold enrichment of loci significantly associated with both diapause intensity and diapause termination on specific chromosome and LD groups.
Dia_ecl_clines.R – This script estimates allele frequencies across the geographic range of Rhagoletis pomonella for loci significantly associated with diapause intensity and diapause termination.