Pronounced genetic separation among varieties of the Primula cusickiana species complex, a Great Basin endemic
Koontz, Austin; Pearse, William; Wolf, Paul (2021), Pronounced genetic separation among varieties of the Primula cusickiana species complex, a Great Basin endemic , Dryad, Dataset, https://doi.org/10.5061/dryad.12jm63xzv
Distinguishing between populations with strong genetic structure and unique species is a common challenge in systematics, especially for taxa occurring in fragmented habitats where allopatric speciation may be widespread and distinct groups may be morphologically similar. Such is often the case with species complexes across sky island environments. In these scenarios, biogeography may help to explain the taxonomic relations between species complex members, and restriction site-associated DNA (RAD) sequencing methods are commonly used to compare closely related species across thousands of loci. Here we use RADseq to clarify the boundaries separating the geographically distinct but morphologically similar varieties of the Primula cusickiana species complex, and to contextualize past findings of strong genetic structure among populations within varieties. Our genetic analyses demonstrate pronounced separation between isolated populations of this Great Basin endemic, indicating that the current varietal classification of complex members is inaccurate, and emphasizing their conservation importance. We discuss how these results correspond to recent biogeographical models used to describe the distribution of other sky island taxa in western North America. Our findings also fit into a wider trend observed for alpine Primula species complexes, and we consider how edaphic specialization and heterostylous breeding systems may be contributing to frequent diversification via allopatric speciation in this genus.
For each sample in a wild population (located using herbaria data), two leaves were removed and placed in labeled paper envelopes, which were stored on silica crystals to keep samples dry. Additionally, 2 herbaria specimens of Primula capillaris were included with the other 87 field collected samples. After collection completed, all samples were placed into Qiagen Collection Microtubes (catalog number 19560) and sent to the University of Wisconsin-Madison Biotechnology Center, for DNA extraction, library prep, and DNA sequencing.
DNA was extracted using the QIagen Dneasy mericon 96 QIAcube HT kit, and quantified using the Quant-iT PicoGreenR dsDNA kit (Life Technologies).
Libraries were prepped following Elshire et al. 20210. ApekI restriction enzyme was used to digest 100 ng of DNA. Following digestion, Illumina adaptor barcodes were ligated onto DN fragments using T4 ligase. Size selection for 300--450bp fragments was run on a PippinHT, and samples were purified using a SPRI bead cleanup after size selection. Adapter-ligated samples were then pooled and amplified, and washed using a post-amplification bead cleanup. Libraries were then sequenced on an Illumina NovaSeq 6000 2x150.
Raw FASTQ data files were demultiplexed and processed using steps 1—7 of the ipyrad software, version 0.9.31 (Eaton and Overcast 2020). Single nucleotide polymorphisms (SNPs) recognized by ipyrad were used as the basis for variation between individuals for downstream analyses, and assemblies were generated de novo. All ipyrad and STRUCTURE parameter files, as well as R scripts used for analysis and data visualization, can be found in the Supplementary Materials available on Dryad (Koontz et al. 2022), as well as on GitHub (github.com/akoontz11/Primula/). Raw, demultiplexed sequencing data can be accessed on the NCBI Sequence Read Archive (SRA; accession number PRJNA705310).
Complex-wide Genetic Survey — For our complex-wide genetic survey, we ran ipyrad twice: we used the results from our initial run to confirm sequencing consistency for replicate samples, and to identify samples with low coverage. For both runs, demultiplexed sequences were paired and merged, and low quality bases, adapters, and primers were filtered prior to SNP calling. Default values were used for the ipyrad parameters in these steps, as well as for the clustering threshold (clust_threshold; 0.85) and minimum sequencing depth (mindepth_statistical; 6) parameters.
For our initial run, we specified a minimum number of samples per locus (min_samples_locus) parameter of 10, in order to obtain loci shared between two to three sample locations for any variety. Using the results from this run, we used the Python script vcf2Jaccard.py to compare samples with replicates by calculating the mean Jaccard similarity coefficients between all samples. We found that all replicates matched highly with their corresponding samples (Fig. S1).
After merging replicates and removing low coverage (generally, less than 30 loci in the final assembly) samples from the dataset, 82 of our 87 original samples remained for our complex-wide analysis. We reran ipyrad (steps 1-7) using these 82 samples to select for loci specific to this subset. We used a min_samples_locus parameter of 32 for this second run, to match the ratio of minimum samples per locus used in our initial run; ipyrad default values were used otherwise. Because very low numbers of loci were retrieved for both herbarium specimens of P. capillaris (possibly due to the age of these specimens), we were unable to include P. capillaris in downstream clustering analyses.
variety Specific Clustering — In addition to our complex-wide survey, we were interested in exploring population structure within variety maguirei which could not be resolved using genetic loci shared across all species complex members. To do so, we ran ipyrad on just the 18 maguirei samples used in our complex-wide survey. Because five samples from each of the upper Logan canyon sampling sites were included in our ipyrad assembly, we specified a min_samples_locus parameter of 5; ipyrad default parameter values were used otherwise.
STRUCTURE — To visualize relations between complex members across their geographic range, and to determine the number of identifiable genetic clusters within the complex, we used the program STRUCTURE version 2.3 (Pritchard et al. 2000). STRUCTURE uses Bayesian clustering analysis to probabilistically assign individuals to one or more of K source populations, where the loci within each population are assumed to be in Hardy-Weinberg proportions and linkage equilibrium. For all STRUCTURE runs, we used a burnin length of 50,000, and 100,000 MCMC reps after burnin. For our complex-wide survey, we ran STRUCTURE for K values of 2—16, with 50 replicates per K value. For our maguirei-only analyses, we ran STRUCTURE for K values of 2—6, with 50 replicates per K value. We used the CLUMPAK server (Kopelman et al. 2015) to summarize results across replicates for each K value, and to build STRUCTURE plots.
For all of our STRUCTURE analyses, we used two different methods to determine the “optimal” K value: the method described in the STRUCTURE manual (Pritchard et al. 2000), which identifies the K value with the greatest likelihood (the natural logarithm of the posterior probability of the genetic data at a specific K value); and the method described in Evanno et al. (2005), which measures the second order rate of change in the posterior probability of the genetic data between successive K values (ΔK). While the ΔK method described by Evanno was designed to better accommodate scenarios in which unequal migration is occurring between putative source populations, it is worth noting that both techniques are ad hoc and require interpretation (as described by their authors). Furthermore, there is an inherent difficulty in inferring an unambiguous number of genetic clusters from any given set of populations (Novembre 2016). Therefore, following recommendations in the original STRUCTURE manual (Pritchard et al. 2000), we generated STRUCTURE outputs within a range of K values, and examined how relationships changed or remained constant across values. Because robust species boundaries consider genetic data in concert with other types of data (Carstens et al. 2013), we took the geographic distribution of each population into consideration when determining a biologically reasonable value of source populations. We were particularly interested in visualizing the extent of genetic differences between geographically distinct populations within varieties across K values, and how different values of K illustrated the extent of admixture (as indicated by individuals having portions of their genomes assigned to different source populations) between adjacent populations. Finally, given evidence for strong genetic structure between the populations of variety maguirei (Wolf and Sinclair 1997; Bjerregaard 2004), we were interested to see if either population in Logan Canyon was more closely related to another species complex population than it was to the neighboring population within Logan Canyon.
SplitsTree — To better visualize the relationships among the taxa sampled in this survey, we used the NeighborNet split network algorithm (Bryant and Moulton 2002). NeighborNet works by iteratively grouping pairs of taxa together based on similarities, in the same way that a neighbor joining tree is built. However, rather than a tree, the end result is a split network, in which splits in the taxa are represented by parallel lines, and conflicting signals in relations of taxa are represented by boxes. The ability to represent different phylogenetic hypotheses simultaneously allows this method to accommodate processes which undermine traditional phylogenetic analyses, such as scenarios involving recombination and hybridization. This makes the NeighborNet algorithm useful for summarizing RADseq data, where patterns of loci across individuals may imply different genealogies. For our complex-wide survey, we passed the output from our ipyrad de novo assembly to the software SplitsTree4 v4.17.1 (Huson and Bryant 2006), which we used to construct the NeighborNet split network.
Discriminant Analysis of Principal Components — In addition to STRUCTURE, we analyzed the results of our complex-wide survey using Discriminant Analysis of Principal Components (DAPC; Jombart et al. 2010) in the package adegenet in R version 3.6.3 (R Core Team, 2020). DAPC is a statistical technique designed to accommodate the size of genomic data sets and capable of differentiating within-group variation from between-group variation. SNP data is first transformed using a principal components analysis (PCA), and then k-means clustering is run to generate models and likelihoods corresponding to each number of population clusters. The best-fitting model, and so the best-supported number of populations, is assessed using the models’ Bayesian Information Criterion (BIC) scores. We chose to utilize DAPC in addition to STRUCTURE to visualize population clusters in a PCA format, and to determine whether the supported number of populations was congruent between methods, indicating a more robust determination of the number of species contained within the complex (Carstens et al. 2013).
FST Estimates—Because we wanted to measure the extent of genetic variance within the groups analyzed, we used the VCFtools software (Danecek et al. 2011) to generate weighted FST estimates (Weir and Cockerham 1984). We generated an FST estimate for our complex-wide analysis (across all populations of all P. cusickiana varieties) as well as for the samples included in our variety maguirei-only analysis.
In addition to the raw data (demultiplex, paired-end FASTQ files) used in this analysis, the following two directories are included in this repository:
- "Code" contains the scripts used in the analysis of this dataset, and matches what can be found on the GitHub website. The README within the repository explains the structure of this folder.
- "Figures" contains the 5 supplementary figures referenced in the manuscript (as .tiff files), as well as the Table of sample populations (as a .xlsx file).