Ancient hybridization leads to the repeated evolution of red flowers across a monkeyflower radiation
Data files
May 11, 2023 version files 6.11 GB
-
clipped.Fd.pi_dxy.w_50000.m_5000.raw_unified_genotyper_postBQSR_all_sites.geno.csv
-
Fd.pi_dxy.w_50000.m_5000.raw_unified_genotyper_postBQSR_all_sites.geno.csv
-
LG4.OC_R.vcf
-
LG4.puniceus_sd_R.vcf
-
LG4.puniceus_sd_Y.vcf
-
LG4.rutilus.vcf
-
lon_aridus.w_10000.s_100_dxy.txt
-
lon_rut.w_10000.s_100_dxy.txt
-
MYB2.LG4.fasta
-
nomissing.biallelic.phased_chrom_filtered_vars_uniGT.LDpruned.min4.nexus
-
nomissing.biallelic.phased_chrom_filtered_vars_uniGT.LDpruned.vcf.gz
-
nomissing.biallelic.phased_chrom_filtered_vars_uniGT.min4.phy
-
nomissing.biallelic.phased_chrom_filtered_vars_uniGT.min4.phy.contree
-
nomissing.biallelic.phased_chrom_filtered_vars_uniGT.vcf
-
OCR_aridus.w_10000.s_100_dxy.txt
-
OCR_OCY.w_10000.s_100_dxy.txt
-
OCR_rut.w_10000.s_100_dxy.txt
-
OCY_aridus.w_10000.s_100_dxy.txt
-
README.md
-
rut_aridus.w_10000.s_100_dxy.txt
-
SDR_aridus.w_10000.s_100_dxy.txt
-
SDR_OCR.w_10000.s_100_dxy.txt
-
SDR_rut.w_10000.s_100_dxy.txt
-
SDR_SDY.w_10000.s_100_dxy.txt
-
SDY_aridus.w_10000.s_100_dxy.txt
-
w_10000.s_100.aurantiacus.puniceus_oc_R.aridus.nomissing_biallelic_clade_D_aridus_parviflorus.geno.fd.csv
-
w_10000.s_100.aurantiacus.puniceus_sd_R.aridus.nomissing_biallelic_clade_D_aridus_parviflorus.geno.fd.csv
-
w_10000.s_100.aurantiacus.rutilus.aridus.nomissing_biallelic_clade_D_aridus_parviflorus.geno.fd.csv
-
whole_genome_sample_info.txt
-
wout_aur_clade_D_aridus_parviflorus_no_LD_filter.eigenval
-
wout_aur_clade_D_aridus_parviflorus_no_LD_filter.eigenvec
-
wout_aur_labels.csv
Abstract
The re-use of old genetic variation can promote rapid diversification in evolutionary radiations, but in most cases, the historical events underlying this divergence are not known. For example, ancient hybridization can generate new combinations of alleles that sort into descendant lineages, potentially providing the raw material to initiate divergence. In the Mimulus aurantiacus species complex, there is evidence for widespread gene flow among members of this radiation. In addition, allelic variation in the MaMyb2 gene is responsible for differences in flower color between the closely related ecotypes of subspecies puniceus, contributing to reproductive isolation by pollinators. Previous work suggested that MaMyb2 was introgressed into the red-flowered ecotype of puniceus. However, additional taxa within the radiation have independently evolved red flowers from their yellow-flowered ancestors, raising the possibility that this introgression had a more ancient origin. In this study, we used repeated tests of admixture from whole-genome sequence data across this diverse radiation to demonstrate that there has been both ancient and recurrent hybridization in this group. However, most of the signal of this ancient introgression has been removed due to selection, suggesting that widespread barriers to gene flow are in place between taxa. Yet, a roughly 30 kb region that contains the MaMyb2 gene is currently shared only among the red-flowered taxa. Patterns of admixture, sequence divergence, and extended haplotype homozygosity across this region confirm a history of ancient hybridization, where functional variants have been preserved due to positive selection in red-flowered taxa but lost in their yellow-flowered counterparts. The results of this study reveal that selection against gene flow can reduce genomic signatures of ancient hybridization, but that historical introgression can provide essential genetic variation that facilitates the repeated evolution of phenotypic traits between lineages.
Methods
Genome resequencing and variant calling
Leaf tissue was collected from the wild (Table S1), and DNA from the 10 new samples was extracted, prepared, and sequenced as described in Stankowski et al. (2019). Variant calling, filtering, and haplotype phasing were performed as described in Stankowski et al. (2019) after mapping reads from all 47 samples to the M. aurantiacus reference genome. We then further filtered the VCF file for biallelic SNPs and removed sites with missing genotype data. The final data set contained 12,730,568 SNPs across all 47 samples.
PCA, network, and phylogenetic analyses
Due to evidence of previous gene flow among taxa in this radiation (Stankowski et al 2019), we generated a neighbor-joining splits network using SplitsTree4 v.4.17.1 (Huson, Kloepper, & Bryant 2008) from all 47 samples. A splits network provides a way to visualize historical events, such as hybridization, in a tree-like framework without requiring a fully bifurcating tree. We filtered SNPs for linkage disequilibrium in Plink version 1.90 (Chang et al. 2015) to remove variant sites with an r2 greater than 0.2 in 50 kb windows, sliding by 10 kb, resulting in a data set containing 1,640,850 SNPs. For comparison, we also used IQ-TREE v1.6.12 (Nguyen et al. 2015) to generate a maximum likelihood consensus tree from a concatenated dataset of all 12,730,568 biallelic SNPs across the 47 samples. Support for each node was generated from 1000 bootstrap replicates, visualized using FigTree v1.4.4 (Rambaut, 2009), and rooted on the branch leading to the M. clevelandii samples. To assess clustering patterns among more closely related samples, we ran a principal components analysis (PCA) in Plink using only the 27 samples from clade D.
Tests for genome-wide admixture
We calculated Patterson’s D for all possible groups of three in-group taxa using the Dtrios command in the Dsuite software package (Malinsky, Matschiner, & Svardal 2021) based on the relationships inferred from genome-wide data in Stankowski et al. (2019), with M. clevelandii used as the outgroup for all tests. The four samples from subspecies grandiflorus were removed prior to this analysis, as the calculation of D using these samples does not provide information about the history of introgression between clade D and aridus. We corrected for multiple tests using a Bonferroni-corrected alpha of 0.0009.
fd was calculated in 50 kb non-overlapping genomic windows using the ABBABABAwindows.py Python script (https://github.com/simonhmartin/genomics_general) and smoothed across the genome using the loess function from the scales package (Wickham and Seidel 2022) in R with a span of 0.02. To determine the average level of admixed genetic variation between aridus and the red or yellow ecotypes using P1 taxa at different levels of taxonomic divergence, we estimated the mean fd from each test across 50 kb windows. To estimate the average level of sequence divergence between the red or yellow ecotypes and the taxa from clades C and D, we calculated average da in 50 kb windows for each pair of taxa (Nei & Li 1979). da accounts for genetic divergence that predated species divergence by subtracting the average intraspecific pairwise differences (π in both species) from the observed interspecific value (dxy). π and dxy were calculated using PIXY version 1.2.5 (Korunes & Samuk 2021) with both variant and invariant sites included. To test for significant differences among the mean fd values, we fit a linear model with fd as the dependent variable and the P1 taxon as the independent variable using the lm function in R and then used the emmeans package (Lenth, 2019) to perform pairwise comparisons of the estimated marginal mean fd values for the different P1s from the linear model.
We estimated the Spearman’s rank correlation coefficient between fd, FST, and π using the cor.test function in R. FST was calculated in each 50 kb window between the red or yellow ecotype and one of the five remaining taxa from clades C and D using the popgenwindows.py Python script (https://github.com/simonhmartin/genomics_general), with only variant sites included. Finally, we sorted the fd values calculated in 50 kb windows into quantile bins based on recombination rates estimated in 500 kb windows in Stankowski et al. (2019). We then calculated the mean and 95% confidence intervals of the fd values within each recombination rate quantile bin. To test for differences in fd among recombination bins, we fit a linear model with fd as the dependent variable and recombination bin as the independent variable and then perform pairwise comparisons of the estimated marginal mean fd values for the different recombination bins from the linear model.
Admixture and genetic divergence across the MaMyb2 region
To determine the history of the red allele at MaMyb2 in the different lineages of clade D, we calculated fd and dxy as described above, but in overlapping 10 kb windows with 100 bp steps. fd was calculated three times, each time with aurantiacus as P1 and aridus as P3, but varying the different red samples (red ecotype, red-OC, and red longiflorus) as P2. dxy was calculated between each of the red taxa and aridus, each of their closest yellow-flowered partners and aridus, between each of the red-flowered taxa, and between each of the red taxa and their most closely related yellow partners.
Haplotype Network
We used VCFx version: 2.0.6b to generate a FASTA file of the entire MaMyb2 gene (positions 12,317,113 to 12,318,500 on chromosome 4) from the VCF file from all 47 samples. This FASTA file was then used to construct a haplotype network from all recovered haplotypes based on an infinite site model and uncorrected distances using Pegas version 1.1 (Paradis 2010) in R version 3.6.3.
Extended Haplotype Homozygosity
We calculated the site-specific extended haplotype homozygosity statistic for a SNP in the middle of the MaMyb2 gene (position 12,317,808) using rehh version 3.2.2 (Gautier & Vitalis 2012). Separate VCFs were created that contained samples from the red ecotype, the yellow ecotype, red OC, or red longiflorus. Haplotype and marker information was extracted from each VCF file using the data2haplohh function of the rehh package. EHH was calculated from the marker and haplotype information for each of the taxa using the calc_ehh function in rehh.