Data from: Population genomic signatures of founding events in autonomously self-fertilizing plants: A test with Impatiens capensis
Data files
Jul 21, 2025 version files 1.45 GB
-
dadi_DRYAD.zip
33.60 KB
-
F_DRYAD.zip
2.12 MB
-
Fastq_DRYAD.zip
1.43 GB
-
LD_DRYAD.zip
3.36 MB
-
README.docx
30.69 KB
-
README.md
8.25 KB
-
RZooROH_DRYAD.zip
1.48 MB
-
Spatial_DRYAD.zip
8.77 MB
Dec 22, 2025 version files 2.43 GB
-
Create_Bam_and_vcf_files_DRYAD.zip
1.06 GB
-
dadi_DRYAD.zip
51.26 KB
-
F_Dryad.zip
640.66 KB
-
Figure1_files.zip
1.34 GB
-
LD_DRYAD.zip
18.71 MB
-
Pi_and_Fst.zip
2.89 MB
-
README.docx
29.78 KB
-
README.md
9.66 KB
-
RZooROH_DRYAD.zip
2.75 MB
-
Spatial_Dryad.zip
27.62 KB
Abstract
Autonomously self-fertilising plants possess disproportionate abilities to found populations. Viewed from the metapopulation perspective, founding events should be frequent in such plants, but the intensity and timing of bottlenecks and recovery should vary among populations. We tested the hypothesis that variation in these demographic characteristics in one such species helps to explain variation in levels of genetic diversity and population genomic signatures of inbreeding, relative recombination, and microscale spatial genetic structure. We used reduced-representation sequence data from eleven populations of the dimorphic cleistogamous species Impatiens capensis, a species that has figured prominently in evolutionary studies. The populations occur in a landscape where suitable habitat is fragmented. Population genomic analyses revealed significant among-population variation in demographic history, genetic diversity, inbreeding, relative recombination rate, tracts of homozygosity by descent, and spatial autocorrelation of genotypes at micro-geographic scales. Our findings support the hypothesis that variation in the intensity of bottlenecks and length of the post-bottleneck recovery phase in autonomously self-fertilising plants lead to variation in genetic diversity and a suite of associated population genomic signatures of inbreeding. We suggest that these findings have consequences for understanding evolutionary processes and guiding conservation strategies in fragmented habitats for dimorphic
README
A. Directory: Create_Bam_and_vcf_files_DRYAD.zip. This directory contains files used to convert the Impatiens capensis fastq files in NCBI Bioproject Accession number PRJNA524160 (SRP186790) to Bam and vcf files.
File: Snakefile. BAM_files were created with this code starting from the fastq files using the software Snakemake
Subdirectory: Impatiens_reference_genome. This subdirectory contains the fasta file for the Impatiens capensis genome (hirise.fa) along with associated index files created with BWA -index. The associated files are (hirise.fa.*)
Subdirectory: STACKS. This subdirectory contains the files associated with the use of STACKS for producing vcf files:
File: gstacks.log. Contains information on the run conditions for gstacks
File: popmap270.txt. Maps each BAM file to one of the 11 populations studied.
File: pops_Extract_11_vcftools is the script used run VCFtools v. 0.1.13 to separate the data in the large (multipopulation) vcf files into single population VCF files.
File: populations _one_snp_per_locus.log. Contains information on the run conditions for STACK-populations for the run with one SNP per locus.
File: populations.log. Contains information on the run conditions for STACK-populations for the run with multiple SNPs per locus
File: populations.samples.fa contains locus consensus sequences used for calculating pi
File: populations.snps.vcf is the vcf file produced by STACK-populations for the run with one SNP per locus
File: SRARunTable.csv. Sample metadata for sequenced individuals downloaded from NCBI Sequence Read Archive for BioProject PRJNA524160 (SRP186790).
File: gstacks.log. Contains information on the run conditions for gstacks
B. Directory: dadi_DRYAD.zip. This directory contains the code used to create site frequency spectrum (SFS) files, the SFS files themselves, and the code for running dadi to examine support for different demographic models, and to test and graph the relationship between estimates of demographic parameters and population genetic metrics.
Subdirectory: Code_for_dadi
File: dadi_MASK.py was used to examine models of demographic history and run with
File: easySFS.py was used to create folded (and masked) site frequency spectra from
File: run_dadi_MASK.txt was used to run dadi_MASK.py
File: run_easySFS.sh was used to run easySFS.py
Subdirectory: Correlations_with_demography_(FigS4). This subdirectory contains a csv file (final_metrics_for_plots.csv) and R code (make_suppmat_fig_S4.R) to create Fig .S4 in the text.
Subdirectory: SFS. This subdirectory contains the site frequency spectra files read by dadi_MASK.py. There are 11 of these files coded as xxxx-48.sfs where xxxx is the population ID
C. Directory: Pi_and_Fst.zip. This directory contains code for estimating FST and pi.
This folder contains the derived data files generated by scripts
Files: calc_FST_and_fixed_SNPs.R and calc_pi.R are used for calculating FST and pi. Note: the script calc_pi.R requires some custom functions that are stored inside parsing.R and stats.R.’
Subdirectory: pi_output. The derived files are:
File: bpstats.0.75.STACKS2.52_p11_NO_write-single-snps_BPstats.Robj. distribution of number of base pairs co-genotyped between each pair of sequenced individuals (bpstats.0.75.STACKS2.52_p11_NO_write-single-snps_BPstats.Robj)
File: FST_Nei_STACKS2.52_p11_NO_write-single-snps.csv. A matrix of mean pairwise FST between each pair of populations
File: gt.0.75.STACKS2.52_p11_NO_write-single-snps_gt.Robj. a derived matrix of individuals x genotypes\
File: pistats.STACKS2.52_p11_NO_write-single-snps_pistats.Robj matrix of pairwise pi between each pair of sequenced individuals
File: pi_per_pop-STACKS2.52_p11_NO_write-single-snps.csv mean pi for each population
D. Directory: LD_DRYAD.zip. This directory contains code for producing heatmaps of linkake disequilibrium (LD), and for examining the relationship between LD (r2 measure) and distance between SNPs, and estimating recombination from this relationship.
Subdirectory: LD_heatmaps. This subdirectory contains the R program LDheatmap.R which was used to create LD heatmaps for Scaffold_1 from vcf files
Subdirectory: Plink_LD. This subdirectory contains vcf files for each population (popxx.recode.vcf), where popxx is a label for each population.
This subdirectory also contains the script PlinkLD.sh used to create the files popxx.ld.ld (where popxx is a wild card for each of the 11 populations) and the associated Plink log files popxx.ld.log. The popxx.ld.ld files contain the information required to plot linkage disequilibrium (as r^2) against distance between SNPs in bp, which in turn is used to estimate recombination (see the R program LDdecayplots.R).
E. Directory: RZooROH_DRYAD.zip. This directory contains code for examining runs for homozygosity by descent (HBD).
Subdirectory: Code_for_HBDs. RZooROH.R is the program that uses the R package RZooROH to estimate the distribution of HBD lengths for all populations (e.g., reads in all files such as genoKR.txt, etc.) and produced the HBD length distribution histograms). It also produces a table of BIC values. PlotROH.py is a Python program used to create histograms of HBDs. HBD_metrics.py is a Python program used to calculate summary stats for HBD segment lengths. HBD_bootstrapping.py is a Python program used to bootstrap HBD segment data and create 95% CIs for summary stats.
Subdirectory: Files_for_RZooROH. This subdirectory contains the program convert_vcf_to_rzooroh_clean_sorted.py, which converts vcf files to a format readable by RZooROH. This program produces separate files for each population. The script format_for_ROH.sh automates the running of the script, files are produced that are ready to be read by RZooROH, and are labelled genopop.txt, where pop is a wildcard for each population label.
F. Directory: F_Dryad.zip. This directory contains code for estimating the inbreeding coefficient in popualations and also contains information on chasmogamous and cleistogamous flower production.
Subdirectory: Plink_-het_code_and_output. This subdirectory contains the plink_het.sh script used to create observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates. Files ending in .het are the output (to create observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates) of running plink_het.sh. Files ending in .log give run information for each population from executing plink_het.sh (e.g., number of individuals, number of variants, genotyping rate).
Subdirectory: F(IS)_by_population. In this subdirectory are filles ending in .xlsx contain the same information as the .het files (in the subdirectory: Plink_-het_code_and_output), but in Excel format. These were created by the python program convert_Plink_F_output_to_Excel.py. The program Process_F_values_in _Excel_files_for_histogram_Fig2.R was used to create histograms and summaries of the method-of-moments F coefficient estimates for each population. The program bootstrap_F.py was used to calculate 95% CI’s of summary stats for F.
File: logCH/CL.csv. This file contains information on flower ratios from Toczydlowski and Waller (2022).
G. Directory: Spatial_Dryad.zip. This directory contains code and data for examining spatial autocorrelation of genotypes
File: AutoCorr.R This file is the R script for calculating the spatial autocorrelations (Smouse and Peakall’s r) from the genetic distance matrices and spatial coordinate information.
Subdirectory: Plink_genetic_distances. This subdirectory contains the output of running the script Plink_script_for_mdist.sh to automate the use of Plink for creating pairwise genetic distance matrices from the vcf files used along with the coordinate information when analyzing spatial autocorrelation of genotypes. The output files of pairwise genetic distances have the suffix are plinkpopxx.mdist, where popxx is the population label
Subdirectory: Spatial_Coordinates. This subdirectory contains the files with spatial coordinates: coordpop.csv where pop is a wildcard for the populations surveyed. The first column in each file is the plant number, while the remaining two columns are the x and y coordinates in cm. The files pop_file.txt, where pop is a wildcard for the population contain the order of the labels associated the genomic data of individual numbered in the coordpop.csv files.
H. Directory: Figure1_files.zip. This directory contains files needed for the production of Fig. 1 (distribution map).
File: build_sample_location_figure.R is the R code used to create the figure.
Subdirectory: wiscland contains metadata Wiscland 2-Level 1 .docx, Wiscland 2-Level 1.xml through to Wiscland 2-Level 41 .docx, Wiscland 2-Level 4.xml and plotfiles Wiscland2_L2_42x48.pdf and Wiscland2_L3_42x48.pdf, userguide, wiscland2_dataset consisting of level1 and level4 subdirectories with map information files, and readme_for_wiscland2.pdf.
File: README.docx. A word version of this README.
Reference genome
We assembled the Impatiens capensis reference genome (696 Mb) from DNA extracted from a single, inbred plant located in Glen Sutton, Quebec, Canada. The reference genome is principally made up of ten chromosome-sized scaffolds that encompass over 95 % of the total estimated genome, with a complete BUSCO of 99.2 %. Schoen & Speed (2024) provide additional details of the reference genome.
Population sampling and genotyping of plants
The eleven populations of Impatiens capensis studied occur in remnant floodplain forests and marshes in the US state of Wisconsin as described by Toczydlowski & Waller (2019; Fig. 1). This portion of the native range of I. capensis was covered by the Laurentide ice sheet, which receded from these sites 10,000 to 7,500 years ago (Clayton et al., 2001). Impatiens capensis, like other lowland colonizers, inhabits dynamic environments characterized by frequent flooding disturbance, which may lead to frequent natural population turnover (Menges & Waller, 1983; Johnson et al., 2014). Land conversion, the height of which occurred between 1850 and 1930, and reductions in flooding due to damming have since reduced suitable habitat and led to notable declines in lowland plant species in this region (Waller & Rooney, 2008; Johnson et al., 2016). The sites studied here vary in patch size and degree of habitat fragmentation, but all are relatively small and embedded within a matrix of industrial-scale agriculture and urban development (Toczydlowski & Waller, 2019). These sites span a range of environmental conditions from marshes in full sun to closed canopy mixed conifer swamps. Additional details on site and morphologic characteristics for the populations are available in Toczydlowski & Waller (2019; 2022).
Depending upon the population, 24 or 25 individual plants were sampled at random from within 1.5 x 10 m plots in each population for sequencing. The individual spatial coordinates of the sampled plants within each plot were recorded (to 5 cm resolution) in six of the populations. We used these to examine the degree of intrapopulation spatial autocorrelation of genotypes (see below). Sequence data were obtained from leaf tissue using a genotype-by-sequencing (GBS) protocol and 1 x 100 bp sequencing on an Illumina HiSeq 2000. Toczydlowski & Waller (2019) provide full details of plant sampling, tissue storage, DNA extraction, and sequencing.
We used the FASTQ files described in Toczydlowski & Waller (2019) and archived under NCBI BioProject PRJNA524160 (SRP186790) as the starting point for our analyses. We ran Trimmomatic v. 0.38 (Bolger et al., 2014) to remove adapters, trim low-quality bases, and discard reads below 36 bp. We mapped the remaining reads to the Schoen & Speed (2024) reference genome using BWA v. 0.7.17 (Li & Durbin, 2009), sorted them with samtools, and removed PCR duplicates using GATK MarkDuplicates (GATK v. 4.2.3; Van der Auwera & O’Connor, 2020). Starting with the BAM files, we automated the pipeline using Snakemake v. 9.4.0 (Mölder et al., 2021).
We used gstacks (STACKS v. 2.52; Cachen et al., 2013; Rochette et al., 2019) to filter out low-quality alignments (reads with insufficient mapping quality or extensive soft clipping), build loci, and call genotypes. We retained loci present in at least 75 % of the individuals within each population and across all 11 populations (filtering performed using STACKS-populations). We removed loci showing excessive heterozygosity (>0.5) to avoid issues arising when gene duplicates cannot be distinguished. For all analyses we used the full SNP dataset, except in the case of the spatial autocorrelation analyses, where we thinned the dataset to one SNP per locus (see below). We used a custom Python script to run VCFtools v. 0.1.13 (Danecek et al., 2011) to separate the genotype data into single population VCF files, yielding the final variant set for downstream analyses. At the end of this pipeline, we retained all 270 individuals (none dropped due to low coverage) and 32,239 SNPs.
Demographic histories of populations
We used easySFS v. 0.0.1 to obtain the folded site frequency spectra (SFS) for each population. As input for the demographic modeling software dadi v. 2.4.0 (Gutenkunst et al., 2009), we used the folded SFS, projected to 24 site frequency classes to keep all SFS the same dimensions. We tested three contrasting demographic models. These include: (1) “Constant population size”; (2) “Population size change”; (3) “Bottleneck + population size change” (Table 1). The “Constant size” model assumes a single panmictic population of constant effective size throughout its history. The single estimated parameter is the relative population size (scaled to a reference size ). The “Population size change” model assumes that the population size changed exponentially from size in the past to in the present. Two parameters are estimated in this case: , the duration (in coalescent time units, i.e., generations) over which exponential growth or decline occurred, and , the final population size relative to . The “Bottleneck + size change” model includes a period of constant-size contraction (bottleneck) followed by exponential growth or decline. The population is reduced to size for duration , then changes in size exponentially for duration to a final size .
Unlike some other simulation-based packages for the analysis of demographic history, dadi allowed us to incorporate the observed inbreeding coefficients directly into the demographic models used for parameter estimation. We used median FIS for each population as estimated by Plink (see below). By incorporating inbreeding coefficients, we have effectively informed dadi to replace the normal Hardy–Weinberg sampling step (i.e., conversion of a population allele-frequency density into an expected sample SFS by binomial sampling) with an inbred sampling rule in which the two alleles within each diploid individual are positively correlated with correlation FIS. This allows dadi to attribute heterozygote deficits (and the resulting SFS distortions) to inbreeding rather than forcing the demographic parameters to explain them. Blischak et al. (2020) showed that such inclusion of FIS in demographic reconstruction of partially inbreeding populations improves model parameter estimation accuracy. Under this framework, allele-frequency change is assumed to follow a neutral diffusion process, and departures from Hardy–Weinberg expectations attributable to inbreeding are accounted for through FIS (Gutenkunst et al., 2009). The analysis further assumes that individuals were sampled from a single population with no additional unmodeled substructure or family-level relatedness, and that the folded SFS provides an accurate summary of allele-frequency variation across independent sites (Gutenkunst et al., 2009).
We ran individual runs of dadi with optimization settings of “--reps 50 --timeout 600 --maxiter 1500 --pts 80,100,100". To help ensure that runs of dadi gave maximum likelihood estimates that did not reflect poor optimizations (e.g., local maxima), we ran the program 500 times for each population (with a different starting point for likelihood surface exploration each time) and recorded the results (log likelihoods, AIC values, and the vector of parameter estimates). We retained as the final vectors of parameter estimates, those with the highest likelihoods.
Previously published work explicitly focused on understanding population differentiation, divergence, and gene flow among these populations found them to be strikingly distinct from one another with limited to no gene flow among them (Toczydlowski & Waller, 2019). Mean pairwise FST between populations ranges from 0.12 to 0.38 using the SNP dataset for the present analyses (Fig. S1). We thus restricted our demographic inferences to models that considered single populations, one at a time, rather than multi-population models based on joint SFS models. Multi-population models allow inference of divergence and migration among populations but the number of parameters that must be simultaneously estimated to do so increases quickly (Beichman et al., 2018). Moreover, in attempting to analyse our data using such multi-population models using the relatively modest number of SNPs available with RAD-sequencing, we encountered large standard errors around all parameter estimates as well as unstable likelihoods. This is expected in likelihood estimation when model complexity (numbers of parameters) is increased beyond the scale of the data, and so we did not pursue these models further.
To account for possible non-independence among loci due to linkage disequilibrium, we estimated the standard errors of demographic parameters using a delete-m group jackknife applied to the SFS of each population. The SFS was first partitioned into non-overlapping B blocks, each representing a spectrum of ca. 50 SNPs generated with a custom script. Each block was constructed from consecutive SNPs in the population VCF file to ensure consistent sample size and projection across loci. For each population, parameter estimation was conducted using the “Bottleneck + size change” model in dadi with the inbreeding coefficient (FIS) fixed per population as described above. We then implemented a delete-m jackknife procedure in which groups of m blocks were omitted in turn and the model was then re-optimized on the remaining (B − m) blocks. This was repeated over all possible leave-out groups, where m = 25 (≈ 4 % of the data per replicate), yielding sets of parameter estimates across replicates. The method reduces downward bias in variance estimates that would result if linked SNPs were treated as independent (Loh et al., 2016).
Polymorphism levels and inbreeding coefficients
We calculated absolute π, mean pairwise heterozygosity (or the probability that a pair of alleles sampled at random from two individuals in a population will be different) using custom R scripts following Hancock et al. (2024) (i.e., one minus Equation 3). We included all variant and invariant sites. We accounted for the missingness inherent to reduced-representation sequence data by defining the denominator for π as the total number of base pairs co-genotyped for each pair of individuals. We then calculated mean absolute π for each population and globally across all populations.
We also calculated the percentage of SNPs that were fixed (invariant) within each population using a custom R script. Lastly, we calculated inbreeding coefficients using Plink v. 1.9 (Purcell et al., 2007) with the “–het” command. Plink computes observed and expected autosomal homozygous genotype counts for each individual and provides method-of-moments inbreeding coefficient estimates per individual (FIS), which measures the reduction in heterozygosity of an individual relative to its subpopulation. Because the distributions of per-individual FIS values are skewed, we used the median to characterize levels of inbreeding in each population, referred to hereafter as FIS.
Linkage disequilibrium and recombination
We calculated pairwise linkage disequilibrium (LD) for individual SNPs using the measure r² (Hill & Robertson, 1968) with Plink, evaluating SNPs separated by one million bp or less. We analyzed LD decay with distance between SNPs by fitting r² values to inter-SNP distances in base pairs. We estimated the population size-scaled recombination parameter (ρ = 4Ner, where r is the per-generation recombination rate per bp) from the theoretical LD decay curve (Weir & Hill, 1986) using nonlinear least squares with the nls function in the R Stats package (v. 4.3.2; R Core Team, 2023).
Runs of homozygosity by descent
We analyzed runs of homozygosity by descent (HBD) for each population using the R package RZooROH v. 0.3.2.1 (Druet & Gautier, 2017). RZooROH uses a hidden Markov model to identify HBD regions and outperforms Plink when marker density is low, as in the case of reduced-representation sequencing (Druet & Gautier, 2017). Lavanchy & Goudet (2023) showed via simulation that RZooROH can reliably estimate HBD parameters at densities as low as two SNPs per Mb. In our study, marker densities ranged from 9 to 14 SNPs per Mb.
To estimate the distribution of HBD segment sizes, we followed guidelines described by Druet & Gautier (2017) and tested models differing in “krates” (i.e., the transition rates between hidden states in the HMM). We adjusted the models by varying two features: the number of krates (k), representing the number of age-classes of relatedness modelled by RZooROH; and the expected segment length per class, reflecting the time since shared ancestry (longer segments = more recent inbreeding). Models with k = 3 krates gave the lowest BIC values, but within the k = 3 krates model class, the BIC values were insensitive to changes in the magnitude of the krates. For each population, we recorded: (1) the proportion of HBD class 1 segments (the longest and most recently co-inherited segments); and (2) the 95th percentile of HBD segment length (which reflects recent inbreeding). To avoid potential artifacts from low coverage, we restricted summaries to HBD segments containing at least five SNPs.
Spatial autocorrelation of genotypes within populations
For the six populations where physical coordinates for each individual genotyped plant within the 1.5 x 10 m sampling plots were known, we calculated spatial pairwise distance matrices. We used Plink to compute genetic distances (1 - identity by allelic state) between plants and recorded them as a matrix. We calculated genetic distances using a dataset that we thinned to one SNP per locus using STACKS-populations –write-single-snp to limit pseudo replication (N = 19,993 SNPs). We performed a spatial autocorrelation analysis (using the R package dartR v. 2.9.9.5 (Gruber et al., 2018), which implements the function gl.autocor as described in Smouse & Peakall (1999). We binned pairs of plants into defined distance separation classes (1.75 m, 3.5 m, 5.25 m, 7.0 m, and 8.75 m). At each of these distance classes, we calculated Smouse & Peakall’s r, a pairwise genetic autocorrelation coefficient for codominant, multilocus data. We bootstrapped the data 999 times to obtain 95 % confidence intervals for Smouse & Peakall’s r at each distance class. The graphical results of this analysis (i.e., the autocorrelation coefficient against distance class) were plotted as correlograms for each of the six populations. Where such correlograms cross from positive to negative values of the genetic autocorrelation coefficient is the approximate radius of genetic patches that reflects pollen and seed dispersal distances (Turner et al., 1982; Sokal & Wartenberg, 1983). Such patches can require several generations to become sufficiently distinct to detect (Turner et al., 1982; Sokal & Wartenberg, 1983).
Changes after Jul 21, 2025: The files have been updated to reflect changes made since the initial submission of the manuscript. They now correspond to the methods and results in the published version in New Phytologist. The title, abstract, and methods were updated as well. The README was completely revised.
