Data from: Fluctuating reproductive isolation and stable ancestry structure in a fine-scaled mosaic of hybridizing Mimulus monkeyflowers
Data files
Mar 11, 2025 version files 10.50 GB
-
Allsamples_data.txt
488.88 KB
-
ancestrycalls.2019_2021.txt
1.01 GB
-
ancestrycalls.2022.txt
454.75 MB
-
ancestrycalls.referencepanel.txt
19.16 MB
-
Markedfruits_offspring.txt
132.59 KB
-
Markedfruits.txt
24.31 KB
-
plot_phenology_census.txt
34.56 KB
-
readcounts_bamcoverage.txt
243.39 KB
-
README.md
14.12 KB
-
referencepanel_fullvariantlist.vcf.gz
9 GB
-
selfing_estimation_offspring.txt
135.23 KB
-
sibship_estimation.txt
435.17 KB
-
SNPlist_ancestryinformative.txt
3 MB
-
SNPlist_for_borice.txt
252.80 KB
-
SNPlist_panel100_restricted.txt
361.52 KB
-
testpanel_samples.txt
1.46 KB
Abstract
Hybridization among taxa impacts a variety of evolutionary processes from adaptation to extinction. We seek to understand both patterns of hybridization across taxa and the evolutionary and ecological forces driving those patterns. To this end, we use whole-genome low-coverage sequencing of 458 wild-grown and 1565 offspring individuals to characterize the structure, stability, and mating dynamics of admixed populations of Mimulus guttatus and Mimulus nasutus across a decade of sampling. In three streams, admixed genomes are common and a M. nasutus organellar haplotype is fixed in M. guttatus, but new hybridization events are rare. Admixture is strongly unidirectional, but each stream has a unique distribution of ancestry proportions. In one stream, three distinct cohorts of admixed ancestry are spatially structured at ~20-50m resolution and stable across years. Mating system provides almost complete isolation of M. nasutus from both M. guttatus and admixed cohorts, and is a partial barrier between admixed and M. guttatus cohorts. Isolation due to phenology is near-complete between M. guttatus and M. nasutus. Phenological isolation is a strong barrier in some years between admixed and M. guttatus cohorts, but a much weaker barrier in other years, providing a potential bridge for gene flow. These fluctuations are associated with differences in water availability across years, supporting a role for climate in mediating the strength of reproductive isolation. Together, mating system and phenology accurately predict fluctuations in assortative mating across years, which we estimate directly using paired maternal and offspring genotypes. Climate-driven fluctuations in reproductive isolation may promote the longer-term stability of a complex mosaic of hybrid ancestry, preventing either complete isolation or complete collapse of species barriers.
This dataset provides sample-level and plot-level metadata and processed data related to the analysis of admixture structure and reproductive isolation across space and time within a hybridizing population of Mimulus guttatus and Mimulus nasutus monkeyflowers.
Plot-level observations and sample collections took place during 2019, 2021, and 2022 flowering seasons (April to June) at two sites near the Columbia River Gorge, Washington, USA, named Catherine Creek (CAC) and Little Maui (LM). Tissue was collected from wild samples for sequencing. For some wild samples, offspring seeds were also collected and germinated in the UGA Botany greenhouses for tissue collection and sequencing. Field data and collections were done by Andrea Sweigart and Keith Karoly. Offspring processing, tissue preparation, sequencing, and analysis was done primarily by Matthew Farnitano.
Ancestry and genetic structure information, some of which is summarized here, was obtained using
raw Illumina sequencing data (separately archived at the Sequence Read Archive). Data were sequenced at the Duke University Center for Genomic and Computational Biology and processed at the University of Georgia using the Georgia Advanced Computing Resource Center cluster. Data processing details are provided in the associated manuscript, and the relevant code is provided at https://github.com/mfarnitano/CAC_popgen.
Description of the data and file structure
Files included in this repository:
- Allsamples_metadata.txt
- ancestrycalls.referencepanel.txt
- ancestrycalls.2019_2021.txt
- ancestrycalls.2022.txt
- Markedfruits_offspring.txt
- Markedfruits.txt
- plot_phenology_census.txt
- PRISM_temp_precip_2010-2022_raw.txt
- readcounts_bamcoverage.txt
- referencepanel_fullvariantlist.vcf.gz
- selfing_estimation_offspring.txt
- sibship_estimation.txt
- SNPlist_ancestryinformative.txt
- SNPlist_for_borice.txt
- SNPlist_panel100_restricted.txt
- testpanel_samples.txt
File descriptions:
'Allsamples_metadata.txt' includes metadata for every sequenced sample used in the study, including samples that were filtered out for low coverage or quality. Included in this file are summary statistics from genomic ancestry inference and genetic population structure analyses. All samples are either wild, maternal plants (isMOM==TRUE) or offspring grown from seed collected from the maternal plant (isMOM==FALSE).
Variables included in this file:
- sampleID: a unique sample identifier
- stream: from which of three streams this sample (or its maternal plant) was collected, CAC_stream1 (CAC_S1), CAC_stream2 (CAC_S2), or LM
- Year: the collection year of the sample (or its maternal parent for offspring samples)
- plot: the original name of the 0.5x0.5m plot where this sample was collected. Each collection year has distinct plot names
- plot_renamed: plots from within ~10m sampled in the same year or different years were combined into a single plot ID for simplicity and to facilitate comparisons across years.
- momfullID: unique identifier of each maternal family, shared by the maternal plant and its sequenced offspring
- fruitID: for offspring only, identifier of the marked fruit from which the offspring was collected. Multiple fruits were collected from some maternal plants. NA for maternal plants.
- offspringID: for offspring only, identifier of each individual offspring within a fruit. NA for maternal plants.
- isMom: boolean value, TRUE for wild-growing maternal plants and FALSE for their greenhouse-grown offspring. Note that wild-growing plants with no sequenced offspring still have isMom==TRUE.
- nas_prop: proportion of the genome with M. nasutus ancestry, aka hybrid index, obtained as a genome-wide average from local ancestry inference conducted using Ancestry_HMM. NA if sample has zero markers with called ancestry.
- heterozygosity: the proportion of the genome with heterozygous local ancestry (M. nasutus and M. guttatus), obtained as a genome-wide average from local ancestry inference conducted using Ancestry_HMM. NA if sample had zero markers with called ancestry.
- total: the total number of called alleles (2*called markers) out of 208,560 total markers. Used as a filter for sample quality.
- cohort: the ancestry group, defined by hybrid index from Ancestry_HMM, that the sample belongs to. <0.15=guttatus, 0.85=nasutus, 0.15-0.85=admixed, except samples identified as sookensis. NA if sample has zero markers with called ancestry.
- passed_filter: TRUE if total>=50,000, equivalent to at least 25,000 called ancestry markers.
- passed_strictfamilyfilter: strict filter to identify samples for use in BORICE selfing estimation. To pass, must have sequenced both maternal and offspring samples, passed the marker filter above, and have <5% estimated joint ancestry error for a maternal-offspring pair (see manuscript methods).
- K1_prop: proportion ancestry assignment to group 1 in a NGSadmix admixture analysis with two groups. Group 1 corresponds to M. nasutus ancestry. Only maternal samples passing filter were used in NGSadmix; other samples (including all offspring) have NA.
- PC1, PC2, PC3, PC4, and PC5: coordinates in a PCAngsd genomic PCA analysis for the first five PC axes. Only maternal samples passing filter were used in NGSadmix; other samples (including all offspring) have NA.
- PCA_cluster: assignment to one of six clusters based on PCA coordinates: NAS, LM, CAC-A, CAC-B, CAC-C, or SOOK. Only maternal samples passing filter were used in NGSadmix; other samples (including all offspring) have NA.
‘ancestrycalls.referencepanel.txt’, ‘ancestrycalls.2019_2021.txt’, and ancestrycalls.2022.txt’ together contain the ancestry outputs at each of 208,560 ancestry-informative markers for all samples. Each row is a marker, and each column is a sample. values are 0,1,2, or NA, corresponding to homozygous guttatus, heterozygous, homozygous nasutus, or missing. Posterior probabilities less than 0.9 for any genotype were set to missing (NA).
‘Markedfruits_offspring.txt’ has a line of information for each sequenced offspring from a fruit that was marked in the field (i.e., the flowering date is known), which includes samples from 2019 and 2022 only. The date the marked flower was open, counting from April 1 = 1, is listed as an integer under Days_start_Apr1. Offspring from fruits without a known flowering date have 'NA' for Days_starts_Apr1. Other variables match those in ‘Allsamples_metadata.txt.’
‘Markedfruits.txt’ has a line of information for each fruit that was marked in the field (i.e., the flowering date is known), which includes samples from 2019 and 2022 only. Variables match those in ‘Allsamples_metadata.txt’ for the corresponding maternal plant. Additional variables are listed below:
- mom_sampleID: unique identifier of the maternal plant
- Days_start_Apr1: the date the flower was marked as open, as an integer counting from April 1 = 1
- mom_nas_prop: the hybrid index (nas_prop) for the maternal plant from Ancestry_HMM
- mom_het: the ancestry heterozygosity (heterozygosity) for the maternal plant from Ancestry_HMM
- mom_total: number of alleles with called ancestry (total) for the maternal plant from Ancestry_HMM
- fruitmean_nas_prop: mean hybrid index (nas_prop) of offspring (only those passing the genotype filter) within this fruit
- fruitmean_het: mean ancestry heterozygosity (heterozygosity) of offspring (only those passing the genotype filter) within this fruit
- momfruit_sway: the difference in hybrid index between the mom and the mean of offspring, equal to fruitmean_nas_prop - mom_nas_prop
- momfruit_hetsway: the difference in ancestry heterozygosity between the mom and the mean of offspring, equal to fruitmean_het - mom_het
‘plot_phenology_census.txt’ contains counts of open flowers for each 0.5mx0.5m sampling plot for every census date (typically once every few days to once per week) during the flowering seasons of 2012, 2019, and 2022. Variables included:
- Date: the date of the census in format M/DD/YY
- Year: the year of the census in format YYYY
- Days_start_Apr1: integer date starting at April 1 = 1
- plot: identifier of the 0.5x0.5m plot, each year has unique plot names
- open_flowers: the number of open flowers counted on that census date in that plot. If the plot was not checked, a value of NA is given. If no flowers are open, a value of 0 or -1 is given. A value of 0 is given for the last census date before open flowers were counted, and for the first census date after open flowers were counted; all other dates where no open flowers were counted are given a value of -1.
‘readcounts_bamcoverage.txt’ contains a summary of raw and aligned read coverage for each sequenced sample. Variables included:
- sampleID: unique identifier for the sample
- n_reads_mapped: the number of records in the aligned .bam file, calculated with qualimap
- mean_coverage: the mean coverage across all bases of the reference genome, calculated with qualimap
- std_coverage: the standard deviation of coverage across all bases, calculated with qualimap
- coverage1X, coverage2X, coverage3X, coverage5X, coverage10X: the proportion of reference genome bases covered by at least 1, 2, 3, 5, or 10 aligned reads, respectively.
- Nreadpairs: the raw number of read pairs in the original unaligned fastq.gz file
‘referencepanel_fullvariantlist.vcf.gz’ is a variant call file in gzipped VCF format, containing genotype calls for 38 high-coverage reference individuals of M. guttatus and M. nasutus. Samples were aligned to the Mimulus guttatus var. IM62 v3 reference genome (https://phytozome-next.jgi.doe.gov). 2 of these lines were excluded for poor quality, and the remaining 36 were used to create the relevant snp lists for further analysis. This file can be viewed using the bcftools utilities (https://samtools.github.io/bcftools/) or other open-source tools for reading vcf files.
’selfing_estimation_offspring.txt’ contains outputs from BORICE estimation of selfing rates for each CAC offspring that passed the strict family filter. Variables match the ‘allsamples_metadata.txt file above’, with additional variables included:
- out: the number of MCMC runs supporting this offspring as outcrossed
- self: the number of MCMC runs supporting this offspring as selfed
- prob.self: posterior probability that this offspring is selfed, equal to (self)/(out+self)
- selfcall: assignment of the offspring as selfed or outcrossed, using a 0.9/0.1 posterior probability cutoff. Samples with intermediate posterior probabilities are set to NA.
’sibship_estimation.txt’ contains outputs from BORICE estimation of sibship (full vs. half siblings for offspring of the same maternal plant). Each line corresponds to a pair of outcrossed offspring from the same maternal parent. Variables included:
- family: integer ID given to each maternal family, unique within each Year
- off1: integer ID given to the first offspring in the pair, unique within each maternal family
- off2: integer ID given to the second offspring in the pair, unique within each maternal family
- full: the number of MCMC runs supporting this pair as full-siblings
- half: the number of MCMC runs supporting this pair as half-siblings
- sibcall: "full" or "half" or NA, assignment of the pair as full-siblings or half-siblings or undetermined
- Year: collection year of the maternal family
- momfullID: unique identifier of the maternal family
- offA_fruitID: identifier of the maternal fruit from which the first offspring in the pair was taken
- offA_sampleID: sampleID of the first offspring in the pair
- offB_fruitID: identifier of the maternal fruit from which the second offspring in the pair was taken
- offB_sampleID: sampleID of the second offspring in the pair
- samefruit: TRUE if the pair was taken from the same fruit, FALSE otherwise
‘SNPlist_ancestryinformative.txt’ contains a list of markers used as ancestry informative sites for Ancestry_HMM, defined to have >=80% frequency difference between allopatric M. nasutus and M. guttatus. See manuscript methods for more details on creation of this panel. Each line is the chromosome and position in the Mimulus guttatus var. IM62 v3 reference genome (https://phytozome-next.jgi.doe.gov).
‘SNPlist_for_borice.txt’ contains a reduced list of markers used for BORICE estimation of selfing rates. See manuscript methods for more details on creation of this panel. Each line is the chromosome and position in the Mimulus guttatus var. IM62 v3 reference genome.
‘SNPlist_panel100_restricted.txt’ contains a reduced list of markers used for population structure analysis in ANGSD. This panel was created by testing a trial set of 100 CAC and LM low-coverage samples for >=60% sample coverage and >=20% minor allele frequency. Each line is the chromosome and position in the Mimulus guttatus var. IM62 v3 reference genome.
‘testpanel_samples.txt’ contains a list of sample IDs for the 100 samples used in the trial set to create ‘SNPlist_panel100_restricted.txt’.
Sharing/Access information
These data are associated with the following manuscript, accepted for publication at PLOS Genetics:
Matthew C. Farnitano, Keith Karoly, and Andrea L. Sweigart. 2025. Fluctuating
reproductive isolation and stable ancestry structure in a fine-scaled mosaic of hybridizing
Mimulus monkeyflowers. PLOS Genetics, Accepted manuscript. https://doi.org/10.1371/journal.pgen.1011624
An earlier version of this manuscript is available as a preprint on BioRxiv at https://doi.org/10.1101/2024.09.18.613726
Corresponding Author:
Matthew C. Farnitano
mattfarnitano@gmail.com
For full collection and processing details, see the associated manuscript and the analysis code available at https://github.com/mfarnitano/CAC_popgen.
Along three streams in the Columbia River Gorge area, Washington, USA, 0.5x0.5m square plots were laid down where Mimulus species were growing, during flowering seasons (April through June) of 2012, 2019, 2021, and 2022 (note that not all streams were sampled every year, for details see the associated manuscript).
At each plot for the years 2012, 2019, and 2022, phenological data was recorded: the total number of open flowers within each plot was counted every few days for the duration of the flowering season.
In addition, leaf or bud tissue was collected for sequencing from wild samples of Mimulus guttatus, Mimulus nasutus, and admixed individuals, as well as the allopolyploid species Mimulus sookensis, growing in these streams during the years 2019, 2021, and 2022. For a subset of these wild samples in 2019 and 2022, one or two flowers were marked on the date they opened and the flowering date was recorded; later, the fruits resulting from these marked flowers were collected. Seeds from these fruits were germinated in greenhouse conditions and tissue was collected for sequencing. Note that 2012 samples were collected and presented as part of a previous study (see manuscript for details).
DNA was extracted from both wild (maternal) and greenhouse (offspring) tissue samples using a CTAB extraction protocol. Samples were sequenced using a low-coverage Tagmentation library prep on an illumina sequencer, in three sequencing batches.
Illumina sequence reads were aligned to the Mimulus guttatus IM62_v3 reference genome and processed according to the details in the associated manuscript. Ancestry proportions (hybrid indices) were obtained using a local ancestry hidden markov model (Ancestry_HMM). A PCA was conducted using genotype likelihoods from ANGSD and the PCAngsd program. Selfing and sibship estimation was conducting using BORICE.
