Data from: Population genomic signatures of founding events in autonomously self-fertilizing plants: An examination with an iconic cleistogamous species
Data files
Jul 21, 2025 version files 1.45 GB
-
dadi_DRYAD.zip
33.60 KB
-
F_DRYAD.zip
2.12 MB
-
Fastq_DRYAD.zip
1.43 GB
-
LD_DRYAD.zip
3.36 MB
-
README.docx
30.69 KB
-
README.md
8.25 KB
-
RZooROH_DRYAD.zip
1.48 MB
-
Spatial_DRYAD.zip
8.77 MB
Abstract
Autonomously self-fertilising organisms possess disproportionate abilities to found populations. Viewed from the metapopulation perspective, it is expected that founding events will be frequent in such organisms but that the intensity and timing of bottlenecks associated with founding will vary among populations. We tested the hypothesis that variation in these demographic characteristics will help explain variation in levels of polymorphism and population genomic signatures of inbreeding. We collected reduced-representation sequence data from eleven populations of the cleistogamous species Impatiens capensis, a species that has figured prominently in evolutionary studies. The populations occur in a landscape in which suitable habitat is highly fragmented. Population genomic analyses revealed significant among-population variation in polymorphism, levels of inbreeding, recombination rate, tracts of linkage disequilibrium and homozygosity by descent, and spatial autocorrelation of genotypes at small geographic scales. Our findings support the hypothesis that variation in the intensity and timing of bottlenecks in autonomously selfed plants will lead to variation in polymorphism and a suite of associated population genomic signatures of inbreeding. This finding has implications for our understanding of evolutionary processes in this and other self-fertilising species and has practical implications for understanding evolutionary processes and for guiding conservation strategies in fragmented habitats.
Dataset DOI: 10.5061/dryad.pg4f4qs2f
Description of the data and file structure
A. Directory: “Fastq_DRYAD.zip”
File: “Snakefile”. BAM_files were created with this code starting from the fastq files using the software “Snakemake”
Subdirectory: “STACKS_gstacks_and_populations_files”. The files in this subdirectory contain the catalog of loci created with gstacks, along with associated files created by gstacks. This subdirectory also contains the files created with STACKS-populations, including the vcf file “populations.snps.vcf” for all 11 populations combined and “populations.sumstats_summary.tsv” and “populations.sumstats.tsv”, which contain information on population diversity statistics used in Table 1 of the paper. The popmap270.txt file contains the file names that correspond to data from each of the 11 populations.
Subdirectory: “vcf_files_for_individual_populations”. This subdirectory contains the individual population vcf files created from populations.snps.vcf using vcftools. The code in file “Extract11vcftools.sh” was used to create vcf files for individual populations from the file “populations.snps.vcf”
Subdirectory: “Impatiens_reference_genome”. This subdirectory contains the fasta file for the Impatiens capensis genome (“hirise.fa”) along with associated index files created with BWA -index.
B. Directory: “LD_DRYAD.zip”
Subdirectory: “Plink_LD_decay”. This subdirectory contains the script “PlinkLD.sh” used to create the files “popxx.ld.ld” (where “popxx” is a wild card for each of the 11 populations). The popxx.ld.ld files contain the information required to plot linkage disequilibrium (as r^2) against distance between SNPs in bp, which in turn is used to estimate recombination (see the R program “LDdecayplots.R” and “LD_plot_w_CI.R”.
Subdirectory: “Plink_LDblocksize”. This subdirectory contains the script “Plink_LDblocksize.sh” estimates haplotype block sizes from vcf files for each population labeled with. The files ending in “...LDblocksize.blocks” contain the output of this script for each population.
This subdirectory also contains the program “LD_blocks_boot.py”, which was used to create 95% confidence intervals for the LD block sizes, output that was saved as “pop_LD_bootstrap.csv” where the wild card “pop” refers to each population.
File: “LD_heatmaps.R”, an R program, creates heatmaps of LD from vcf files.
C. Directory: “RZooROH_DRYAD”
Subdirectory: “Files_for_RZooROH”. This subdirectory contains the program “convert_vcf_to_rzooroh_clean_sorted.py”, which converts vcf files to a format readable by RZooROH. This program produces separate files for each population. The script “format_for_ROH.sh” automates the running of “convert_vcf_to_rzooroh_clean_sorted.py” for the separate populations. After running the script, files are produced that are ready to be read by RZooROH, and are labelled “genopop.txt”, where “pop” is a wildcard for each population label.
Subdirectory: “ROH_segments_K3”. This subdirectory contains files with information on the genomic coordinates of HBDs, their length, and their HBD class. The files are labelled as “ROH_segments_pop_K3_noLayers_letter_bestOnly.csv”, where “pop” is a wildcard for the population, and letter refers to the krates model referred to in the program “RZooRoH_ROHExport_ByPopulation_nolayers.R” (see Code subdirectory).
Subdirectory: “Code_for_HBDs”. “RZooROH.R” is the program that uses package RZooROH to estimate the distribution of HBD lengths for all populations (e.g., reads in all files such as genoKR.txt, etc.) and produced the HBD length distribution histograms). It also produces a table of BIC values. “PlotROH.py” is a Python program used to create histograms of HBDs. “HBD_metrics.py” is a Python program used to calculate summary stats for HBD segment lengths. “HBD_bootstrapping.py” is a Python program used to bootstrap HBD segment data and create 95% CIs for summary stats.
D. Directory: “F_DRYAD.zip”:
Subdirectory: “Plink_-het_code_and_output”. This subdirectory contains the “plink_het.sh” script used to create observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates. Files ending in “.het” are the output (to create observed and expected autosomal homozygous genotype counts for each sample, and reports method-of-moments F coefficient estimates) of running plink_het.sh. Files ending in “.log” give run information for each population from executing plink_het.sh (e.g., number of individuals, number of variants, genotyping rate).
Subdirectory: “F(IS)by_populaton”. In this subdirectory are files ending in “.xlsx” contain the same information as the .het files (in the subdirectory: "Plink-het_code_and_output”), but in Excel format. These were created by the python program “convert_Plink_F_output_to_Excel.py”. The program “Process_F_values_in _Excel_files_for_histogram_Fig2.R” was used to create histograms and summaries of the method-of-moments F coefficient estimates for each population. The program “bootstrap_F.py” was used to calculate 95% CI’s of summary stats for F.
File: “logCH/CL.csv”. This file contains information on flower ratios from Toczylowski and Waller (2022).
E. Directory: “Spatial_DRYAD.zip”
Subdirectory: “Spatial_Coordinates”. This subdirectory contains the files with spatial coordinates: “coordpop.csv” where “pop” is a wildcard for the populations surveyed. The first column in each file is the plant number, while the remaining two columns are the x and y coordinates in cm. The files “pop_file.txt”, where “pop” is a wildcard for the population contain the order of the labels associated the genomic data of individual numbered in the coordpop.csv files.
Subdirectory: “vcf files”. This subdirectory contains the files “poppop.recode_filtered.recode.vcf” where “pop” is a wildcard for the populations surveyed. The first column in each file. The order of individuals in the vcf files is the same as the order of individuals in the coordpop.csv files.
Subdirectory: “Plink_genetic_distances”. This subdirectory contains the output of running the script “Plink_script_for_mdist.sh” to automate the use of Plink for creating pairwise genetic distance matrices from the vcf files used along with the coordinate information when analyzing spatial autocorrelation of genotypes. The output files of pairwise genetic distances have the suffix “mdist”.
File: “AutoCorr.R” This files is the R script for calculating the spatial autocorrelations (Smouse and Peakall’s r) from the genetic distance matrices and spatial coordinate information.
F. Directory: “dadi_DRYAD.zip”
Subdirectory: “dadi_parameter_summaries”. This subdirectory contains .csv files labelled as “Model_AIC_params.csv”, where “Model” is a wildcard for the demographic model used in running dadi. Each file contains parameter estimates for the demographic models explored with dadi. These were produced by running the program “dadi_SAVED.py”.
Subdirectory: “SFS_for_dadi”. This subdirectory contains files labelled as “SFS_pop_vector.txt”, where “pop” is a wildcard for the population name. Several SFS files originally had 25 bins while some had 24 bins. So that all SFS input to dadi be based on the same number of bins, and to avoid bins with few entries, we projected the SFS vectors to 21 bins with the program “project_SFS.py”. This files labelled as “SFS_pop_n21.txt” (where “pop” is a wildcard) contain these projected SFS vectors.
Subdirectory: “Code_for_dadi”. Folded site frequency spectra were created from STACKS output using the Python program file “SFS_from_STACKS.py . The Python program “dadi_SAVED.py”, was used to run dadi with different demographic models”.
Reference genome
We assembled the Impatiens capensis reference genome (696 Mb) from DNA extracted from a single, inbred plant located in Glen Sutton, Quebec, Canada. The reference genome includes ten chromosome-sized scaffolds encompassing over 95% of the total estimated genome size, with a complete BUSCO of 99.2%. Schoen & Speed (2024) provide additional details.
Population sampling and genotyping of plants
The eleven populations of Impatiens capensis studied occur in floodplain forest remnants and marshes in the US state of Wisconsin as described by Toczydlowski & Waller (2019; Fig. 1). Prior to human settlement in the 19th and early 20th centuries, the vegetation in this region was largely continuous forest in the north, and a mixture of prairie and forest in the south. Much of it is now industrial-scale agricultural land (Rhemtulla et al., 2009). The populations span a range of environmental conditions from full sun marshes to closed canopy mixed conifer swamps and vary in patch size and degree of habitat fragmentation (Toczydlowski & Waller 2019). Additional details on site and morphologic characteristics for the populations are available in Toczydlowski & Waller (2022).
We sampled leaf tissue at random from 24 or 25 plants within 1.5 x 10 m plots in each population. We recorded the individual sampling coordinates of the sampled plants within each plot in six of the populations and later used these to examine the degree of intrapopulation spatial autocorrelation of genotypes (see below). Toczydlowski & Waller (2019) provide additional details of plant sampling, tissue storage, DNA extraction, and genotyping-by-sequencing (GBS).
We used the fastq files described in Toczydlowski & Waller (2019) and archived under NCBI BioProject PRJNA524160 (SRP186790) as the starting point for our analyses. We ran Trimmomatic v. 0.38 (Bolger et al., 2014) to remove adapters, trim low-quality bases, and discard reads below 36 bp. We then mapped the resulting high-quality reads to the Schoen & Speed (2024) reference genome with BWA v. 0.7.17 (Li & Durbin, 2009), sorted them with samtools, and removed PCR duplicates using GATK MarkDuplicates (GATK v. 4.2.3; Van der Auwera & O’Connor, 2020). We automated the production of BAM files using Snakemake v. 9.4.0 (Mölder et al., 2021) to coordinate these pipeline steps.
We used gstacks (STACKS v. 2.66; Cachen et al., 2013) to filter out low-quality alignments (reads with insufficient mapping quality or extensive soft clipping), build loci, and call genotypes. We retained only loci present in at least 75% of the individuals within each population and across all 11 populations (filtering performed using STACKS-populations). We removed loci showing excessive heterozygosity (>0.5) to avoid issues arising when gene duplicates cannot be distinguished. We retained one SNP per locus in the vcf file created using STACKS-populations. Finally, we used a custom Python script (Extract11vcftools.sh) to run VCFtools v. 0.1.13 (Danecek et al., 2011) and separate the data into per-population vcf files, yielding the final variant set for downstream analyses. After filtering, no individuals were dropped (due to low coverage), and over 24,000 loci were surveyed from each population. The numbers of polymorphic loci ranged from 6,288 (population 4050) to 10,393 (population KR),
Demographic histories of populations
We obtained the folded site frequency spectrum (SFS) files for each population from the allele frequency data in the STACKS-populations output file populations.sumstats.tsv. As input in the Python package dadi v. 2.4.0 (Gutenkunst et al., 2009), we used the folded SFS (projected to 21 site frequency classes for ease of comparison and to reduce sparsely populated classes. We tested five contrasting demographic models. These include: (1) “constant population size”; (2) “exponential population growth”; (3) “exponential population decline”; (4) “population bottleneck”; and (5) “population bottleneck followed by exponential growth” (see Results for details). Unlike several other simulation-based packages, dadi allowed us to incorporate observed inbreeding coefficients directly into the demographic model. For this we used median FIS, as estimated by Plink (see below). Blischak et al. (2020) showed that including FIS in demographic reconstruction of partially inbreeding populations improves model parameter estimation accuracy.
Polymorphism levels and inbreeding coefficients
We obtained nucleotide diversity (π) and numbers of polymorphic loci for each population from the STACKS-populations output file populations.sumstats.tsv. We calculated inbreeding coefficients using Plink v. 1.9 (Purcell et al., 2007) with the “–het” command. Plink computes observed and expected autosomal homozygous genotype counts for each sample and provides method-of-moments inbreeding coefficient estimates per individual (FIS), which measures the reduction in heterozygosity of an individual relative to its subpopulation. Because the distributions of per-individual FIS values were skewed, we used the median to characterize levels of inbreeding in each population, referred to hereafter as FIS.
Linkage disequilibrium and recombination
We calculated pairwise linkage disequilibrium (LD) for individual SNPs as r² (Hill & Robertson, 1968) with Plink and evaluated SNPs separated by one million bp or less. We analyzed LD decay by fitting r² values to inter-SNP distances in base pairs. We estimated the recombination parameter (ρ) from the theoretical LD decay curve (Weir & Hill, 1986) using nonlinear least squares with the nls function in the R Stats package (v. 4.3.2; R Core Team, 2023).
Runs of homozygosity by descent
We analyzed runs of homozygosity by descent (HBD) for each population using the R package RZooROH v. 0.3.2.1 (Druet & Gautier, 2017). RZooROH uses a hidden Markov model to identify HBD regions and outperforms Plink when marker density is low, as in the case of reduced representation sequencing (Druet & Gautier, 2017). Lavanchy & Goudet (2023) showed via simulation that RZooROH can reliably estimate HBD parameters at densities as low as two SNPs per Mb. In our study, marker densities ranged from 9 to 14 SNPs per Mb.
To estimate the distribution of HBD segment sizes, we followed guidelines described by Druet & Gautier (2017) and tested models differing in “krates” (i.e., transition rates between hidden states in the HMM). We adjusted the models by varying two features: the number of krates (k), representing the number of age-classes of relatedness modeled by RZooROH; and the expected segment length per class, reflecting the time since shared ancestry (longer segments = more recent inbreeding). Models with k = 3 krates gave the lowest BIC values, but within the k = 3 krates model class, the BIC values were insensitive to changes in the magnitude of the krates. For each population, we recoded: (1) the proportion of HBD class 1 segments (the longest and most recently co-inherited segments); and (2) the 95th percentile of HBD segment length (which reflects recent inbreeding). To avoid potential artifacts from low RADseq coverage, we restricted summaries to HBD segments containing at least five SNPs.
Spatial autocorrelation of genotypes within populations
For the six populations where we recorded physical coordinates for plants within the 1.5 x 10 m sampling plots, we calculated spatial pairwise distance matrices. We used Plink to compute genetic distances (1 - identity by allelic state) between plants and recorded them as a matrix. We calculated spatial autocorrelation and performed permutation testing (1000 replicates) using the R package dartR v. 2.9.9.5 (Gruber et al., 2018), which implements the function gl.autocor as described in Smouse & Peakall (1999). We binned plant pairs by distance classes (1.75 m, 3.5 m, 5.25 m, 7.5 m, and 9.25 m) and used a randomization test (999 permutations) to create null distributions of Smouse & Peakall’s r for each class, against which we tested the observed values. Smouse & Peakall’s r is a pairwise genetic autocorrelation coefficient for codominant, multilocus data. The graphical results of this analysis (i.e., the autocorrelation coefficient versus against distance class) were plotted as correlograms for each of the six populations. Where such correlograms cross from positive to negative values of the genetic autocorrelation coefficient is the approximate radius of genetic patches that reflect of pollen and seed dispersal distances (Turner et al., 1982; Sokal & Wartenberg, 1983). Such patches can require many generations to become sufficiently distinct to detect (Turner et al., 1982; Sokal & Wartenberg, 1983).