Data from: Evidence for ephemeral ring species formation during the diversification history of Western Fence Lizards (Sceloporus occidentalis)
Data files
Feb 08, 2021 version files 4.25 MB
-
Sceloporus_Dryad_submission.zip
-
SOCC-README.txt
Abstract
Divergence is often ephemeral, and populations that diverge in response to regional topographic and climatic factors may not remain reproductively isolated when they come into secondary contact. We investigated the geographic structure and evolutionary history of population divergence within Sceloporus occidentalis (Western Fence Lizards), a habitat generalist with a broad distribution that spans the major biogeographic regions of Western North America. We used double digest RAD sequencing to infer population structure, phylogeny, and demography. Population genetic structure is hierarchical and geographically structured with evidence for gene flow between biogeographic regions. Consistent with the isolation-expansion model of divergence during Quaternary glacial-interglacial cycles, gene flow and secondary contact are supported as important processes explaining the demographic histories of populations. Although populations may have diverged as they spread northward in a ring-like manner around the Sierra Nevada and southern Cascade Ranges, there is strong evidence for gene flow among populations at the northern terminus of the ring. We propose the concept of an “ephemeral ring species” and contrast S. occidentalis with the classic North American ring species, Ensatina eschscholtzii. Contrary to expectations of lower genetic diversity at northern latitudes following post-Quaternary-glaciation expansion, the ephemeral nature of divergence in S. occidentalis has produced centers of high genetic diversity for different reasons in the south (long-term stability) versus the north (secondary contact).
Methods
Sampling and RAD sequencing
A total of 108 Sceloporus occidentalis were sampled from 87 sites throughout their range in western North America (Fig. 1; Table S1). Double digest RAD sequencing data (ddRADseq) were collected using standard protocols (Peterson et al. 2012). Genomic DNA (500 ng per sample) was double-digested with 20 units each of a rare cutter SbfI (restriction site 5'-CCTGCAGG-3') and a common cutter MspI (restriction site 5'-CCGG-3') in a single reaction with the manufacturer recommended buffer (New England Biolabs) for 8 hours at 37°C. Fragments were purified with SeraPure SpeedBeads before ligation of barcoded Illumina adaptors onto the fragments. The libraries were size-selected (between 415 and 515 bp after accounting for adapter length) on a Blue Pippin Prep size fractionator (Sage Science). The final library amplification used proofreading Taq and Illumina's indexed primers. The fragment size distribution and concentration of each pool was determined on an Agilent 2200 TapeStation, and qPCR was performed to determine library concentrations before multiplexing equimolar amounts of each pool for sequencing on a single Illumina HiSeq 2500 lane (50-bp, single-end reads) at the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley.
Bioinformatics
The raw Illumina reads were filtered and demultiplexed using PyRAD v.3.0.3 (Eaton 2014) and assembled using Stacks v.2.54 (Catchen et al. 2013; Rochette et al. 2019) and STACKS_PIPELINE v.2.4 (Portik et al. 2017). Reads potentially arising from PCR duplicates during sequencing were not explicitly accounted for because they occur at low frequency and presumably do not significantly impact most population genetic parameter estimates (Schweyen et al. 2014). During filtering, sites with < 99% base call accuracy (Phred score = 20) were converted to missing data and reads with ≥ 10% missing sites were discarded. No barcode mismatches were allowed during demultiplexing. Reads were aligned into stacks with a minimum depth of coverage of 5 and a maximum of 2 nucleotide differences between stacks. The minor allele count was set to 2 to eliminate singletons, which reduces errors in model-based clustering methods (Linck & Battey 2019). Loci that were invariant, non-biallelic, or absent from > 20% of samples were removed. Samples with > 70% missing data were also removed. One random variable site per locus was sampled to minimize the chance of retaining physically linked SNPs. The resulting unlinked SNP dataset was used to generate input files for downstream analyses.
Population structure
Two methods were used to estimate population structure. The maximum likelihood method ADMIXTURE v1.3.0 (Alexander et al. 2009) was used to estimate (1) the number of populations (K) and (2) admixture proportions of samples to identify putative "hybrids" of mixed population ancestry. Samples with admixture proportions < 0.95 were considered admixed. The cross-validation errors for analyses from K = 1 to K = 10 were compared to determine which K minimized group assignment error; e.g., the K with the lowest cross-validation (CV) error is the model best supported by the data. However, parametric methods for estimating population structure are sensitive to violations of model assumptions prevalent in natural populations (Lawson et al. 2018). Therefore, Discriminant Analysis of Principle Components (DAPC), a non-parametric method, was also used to estimate population structure as implemented in the R package adegenet (Jombart et al. 2010; Jombart & Ahmed 2011). First, principal components of genetic variation were estimated without any assumptions about grouping information or the underlying population genetic model. Second, the find.clusters function was used to evaluate successive values of K from 1 to 10 with the Bayesian information criterion (BIC) to determine the most likely number of populations. The K value with the lowest BIC score was considered optimal, although it is important to consider that other K values can also provide biologically realistic models.
When multiple SNPs are present across loci, the random selection of one SNP per locus represents one of many possible dataset combinations. To assess the stability and sensitivity of K to this SNP selection procedure, we performed random selection of one SNP per locus 100 times to create 100 replicates of unlinked SNP datasets. These dataset permutations were created using STACKS_PIPELINE. The optimal K values for all 100 replicates were then evaluated with ADMIXTURE and DAPC, and summarized to determine variation in support for K values.
Network and population tree analysis
Three methods were used to visualize intraspecific genetic relationships and estimate relationships among populations. Network methods can depict relationships that are not necessarily bifurcating and can identify admixed samples (Posada and Crandall 2001). A genetic network was constructed from the concatenated SNP data (uncorrected “p” distances) using the NeighborNet algorithm (Bryant and Moulton 2004) in SplitsTree v4.6 (Huson and Bryant 2006). Phylogenetic analyses were conducted using SNAPP v1.5.0 (Bryant et al. 2012) and TreeMix v1.13 (Pickrell & Pritchard 2012). One key difference between these methods is that SNAPP is migration-free and produces strictly bifurcating trees, while TreeMix models migration events as non-bifurcating, internal branches among populations. The population assignments required for the phylogenetic methods were obtained from the ADMIXTURE model for K=5 (see Results). Admixed samples (admixture proportion < 0.95) were excluded prior to using SNAPP to avoid biasing topologies and parameter estimates (Leaché et al. 2014).
SNAPP was implemented in BEAST v2.6.3 via CIPRES (Bouckaert et al., 2019; Miller et al. 2010). The mutation rate priors (u and v) were set to 1, and the Yule birth rate prior (λ) was set to 10. A diffuse prior on the expected divergence between populations (θ) was assigned with a gamma distribution (1, 250) corresponding to a mean of 0.004. Three independent analyses were run for 1 million generations each, sampling every 1000 generations. The posterior distribution of trees was summarized using DensiTree (implemented in BEAST) to visualize uncertainty in the topology and branch lengths. A maximum clade credibility (MCC) tree was summarized using TreeAnnotator (BEAST) after discarding the first 20% of samples as burn-in. Branch lengths were converted to time using a human mutation rate of 1.0 × 10-8 substitutions per site per generation (Lynch, 2010; Scally and Durbin, 2012) and a generation time 2 years for S. occidentalis (Jameson & Allison 1976).
TreeMix was used to estimate a maximum likelihood population tree and infer gene flow between diverged populations. The analysis included all samples that passed missing data thresholds (see Results), including admixed samples. SNPs with missing allele counts in ≥1 population were removed from the input dataset prior to analysis. Block resampling across groups of 100 unlinked SNPs was implemented to model uncertainty in the sample covariance matrix and to effectively explore parameter space. A root position was selected for the population tree according to the topology of the SNAPP tree (see Results). The analysis added 1 – 6 migration events (m) to the tree, and each analysis included 500 bootstrap replicates. Replicates were summarized using the R package OptM (https://CRAN.R-project.org/package=OptM). The value of m that explained most of the variation in the data was determined by comparing the average increase in explained variance with each added migration event and the rate of change in the likelihood with the Evanno method (Evanno et al. 2005).
We further tested for admixed ancestry with the threepop and fourpop tests in TreeMix (Keinan et al., 2007; Reich et al., 2009). The threepop analysis tests for admixture within all possible triplets by evaluating the null hypothesis that the relationship within the triplet can be explained by a simple bifurcating tree with a common ancestor. A significantly negative value of the f3 statistic suggests the presence of admixture within the triplet, and that relationship cannot be captured by a simple bifurcating tree. The fourpop analysis tests for treeness across all possible quartets by evaluating the null hypothesis that two pairs of populations can only be connected by one internal branch. An f4 statistic that significantly deviates from zero suggests the presence of admixture within the quartet, meaning more than one internal branch can connect two pairs of populations.
Demographic models
Demographic models were compared to examine potential gene flow and secondary contact between populations. Divergence scenarios were modeled using joint site frequency spectra (JSFS) in dadi v2.1.0 (Gutenkunst et al. 2009), and model optimization routines were performed using DADI_PIPELINE V3.1.5 (Portik et al. 2017). Parameters of interest included divergence times (T), population sizes (nu), and the amounts and directions of gene flow (m). Primary analyses were conducted on geographically contiguous populations that reflect phylogenetic relationships among populations (see Results). Population abbreviations used were: Southern California (SCA), West Sierra Nevada (WSN), East Sierra Nevada (ESN), Pacific Northwest (PNW), Great Basin (GB). Population sets included WSN/PNW, ESN/GB; and ESN/GB/SCA. Secondary analyses were conducted on geographically contiguous populations that do not reflect direct phylogenetic relationships – these were used to validate whether the presence of admixed individuals at population boundaries can be explained by contemporary gene flow. These population comparisons involved WSN/ESN and PNW/ESN. There are three higher-level divergence scenarios represented by the models including divergence with isolation, divergence with isolation followed by secondary contact, and divergence with continuous gene flow. Eighteen 2D population models were used, which represent variations of the three main model scenarios. These also allow for symmetric or asymmetric gene flow and the possibility of size changes, and are described in greater detail elsewhere (Portik et al. 2017). Four new 3D models were created to test divergence scenarios in ESN/GB/SCA. The results of the 2D ESN/GB modeling were used to fix this aspect of the 3D model. The four models therefore vary in aspects of migration and isolation between SCA and the ancestor of ESN/GB, and between SCA and ESN.
The SNP datasets for each population comparison were obtained by re-running Stacks on the most inclusive subset of samples (e.g., only individuals belonging to populations in the model) to maximize the number of SNPs for modeling. Input files for dadi were created using the Convert_tsv_to_dadi.py module in STACKS_PIPELINE and sample sizes were projected down to account for missing data and maximize the number of segregating sites (Table S3). The optimization routine in DADI_PIPELINE was used to run four consecutive rounds using the following settings: replicates = 10, 10, 15, 25; maxiter = 3, 5, 10, 15; fold = 3, 2, 2, 1, Nelder-Mead optimizer. Replicates are considered independent within an optimization routine, and the top-ranked replicate should represent the global optimum. Although unlikely, it is possible for the analysis to become trapped on a local optima. This can be checked by running the optimization routine more than once. If the log-likelihood values of the top-ranked replicates are highly similar across analyses, this provides evidence for convergence on the global optima. For each model, we performed two independent runs of the optimization routine and ensured that the top-ranked replicates had similar log-likelihood scores.
By using unlinked SNPS, log-likelihood values returned by dadi represent true likelihood values, rather than composite likelihood values. Models were therefore compared using the Akaike information criterion (AIC) including AIC scores, ΔAIC scores and Akaike weights (ωi) (Burnham & Anderson, 2002). The top-ranked model for each population comparison was then subjected to a goodness-of-fit test to verify it represented a reasonable explanation of the JSFS. Goodness-of-fit tests were performed as described in Barratt et al. (2018), using 100 parametric bootstraps for the top-ranked model. The test was considered passed if the empirical log-likelihood was contained within distribution of log-likelihood values obtained from the bootstraps.
Effective migration and landscape connectivity
We used EEMS (Estimating Effective Migration Surfaces; Petkova et al. 2016) to identify corridors of spatial connectivity and geographic barriers to migration and to estimate relative genetic diversity across the species range. EEMS compares empirical dissimilarities in genetic and geographic structure to those expected under an isolation-by-distance (IBD) model to estimate an effective migration surface. Two key parameters are estimated by the model: effective diversity rates (q) quantify the genetic similarity of two individuals from the same location, and effective migration rates (m) quantify how genetic similarity decays across space. Georeferenced samples are assigned to one of a specified number of demes, and dissimilarities are calculated between adjacent demes, with the assumption that an infinite number of demes would approach a continuous estimate of gene flow across the landscape. Three replicate EEMS runs (1 million iterations, 50,000 burn-in, sampling every 1,000 iterations) assuming 700 demes were compared to check for convergence before combining the analyses to summarize the final results. To visualize the spatial structure of genetic diversity (q) and migration (m), contour maps were produced using the R package rEEMSplots (https://github.com/dipetkov).