Skip to main content
Dryad

Biogeographic inferences across spatial and evolutionary scales

Cite this dataset

Wishingrad, Van; Thomson, Robert (2023). Biogeographic inferences across spatial and evolutionary scales [Dataset]. Dryad. https://doi.org/10.5061/dryad.2ngf1vhsw

Abstract

The field of biogeography unites landscape genetics and phylogeography under a common conceptual framework. Landscape genetics traditionally focuses on recent-time, population-based, small geographic scale, spatial genetics processes, while phylogeography typically investigates deep past, lineage- and species-based processes at large geographic scales. Here, we evaluate the link between landscape genetics and phylogeographic methods using the Western Fence lizard (Sceloporus occidentalis) as a model species. First, we conducted replicated landscape genetics studies across several geographic scales to investigate how population genetics inferences change depending on the spatial extent of the study area. Then, we carried out a phylogeographic study of population structure at two evolutionary scales informed by inferences derived from landscape genetics results to identify concordance and conflict between these sets of methods. We found significant concordance in landscape genetics processes at all but the largest geographic scale. Phylogeographic results indicate major clades are restricted to distinct river drainages or distinct hydrologic regions. At a more recent timescale, we find minor clades are restricted to single river canyons in the majority of cases, while the remainder of river canyons include samples from at most two clades. Overall, the broad scale pattern implicating stream and river valleys as key features linking populations in the landscape genetics results, and high degree of clade specificity within major topographic subdivisions in the phylogeographic results, is consistent. As landscape genetics and phylogeography share many of the same objectives, synthesizing theory, models, and methods between these fields will help bring about a better understanding of ecological and evolutionary processes structuring genetic variation across space and time.

Methods

ddRAD LIBRARY PREPARATION AND GENOMIC SEQUENCING

We followed Peterson et al.'s (2012) method for genomic library preparation, with some modifications. For each individual, we extracted high-molecular-weight genomic DNA using a standard phenol-chloroform extraction protocol (Tsai et al. 2019). We measured DNA concentrations using a Qubit fluorometer, and for each sample, we digested 0.5 μg of DNA for 3 hours with SbfI and NIaIII restriction enzymes. We then purified these fragments with Agencourt AMPure beads before ligation of barcoded Illumina adaptors onto the fragment ends. All barcodes differed by at least two base pairs to reduce duplexing error rates. We then pooled equimolar amounts of each sample before conducting size selection using a Pippin Prep to select fragments between 400 and 550 bp in length. We used proofreading Taq and Illumina’s index primers for final library amplification for 8-10 cycles to reduce PCR bias. We quantified the final library concentration using a Qubit fluorometer at high sensitivity. Samples were packed on dry ice and sent to the Vincent J. Coates Genomics Sequencing Lab at UC Berkeley for sequencing. Quantitative PCR was used to determine concentration of adapter-associated fragments, and a BioAnalyzer run confirmed fragment sizes as a quality control measure prior to sequencing. Final libraries were sequenced (100-bp or 150-bp, single-end runs) on Illumina HiSeq 4000 and NovaSeq SP lanes in a total of four sequencing lanes.

BIOINFORMATICS PROTOCOLS FOR SNP DATA

We inspected raw Illumina reads for sample quality using FastQC (Andrews 2010). We used STACKS v1.48 (Catchen et al. 2013) for initial sequence data processing such as demultiplexing samples, steps to rescue barcodes with at most one mismatch, clean data, and remove any read with an uncalled base, and truncate all reads to 95 bases (due to variable-length sequences from using both 100-bp and 150-bp sequencing platforms). Prior to analysis, we concatenated sequencing data for the same individual sequenced on different lanes. Then, we used a reference-based approach in the ipyrad v.0.9.42 (Eaton & Overcast 2020) program to conduct individual-based analyses. Compared to de-novo assembly approaches, reference-based approaches have much lower error rates, higher accuracy, and less bias than do de-novo approaches (Rochette & Catchen 2017). We used the annotated S. occidentalis genome published by Harris et al. (2015) as the reference genome for the analysis.

IPYRAD ANALYSIS PROTOCOL

We followed the analysis recommendations and default settings specified by the program authors with some modifications specific to our datasets. We filtered and edited demultiplexed reads by removing reads with 5 or more low-quality base calls (Q<20), and trimmed bases from the 3’ end of reads if their quality scores fell below 20, which is 99% probability of a correct base call. Reads were then mapped to the reference genome using BWA (version 0.7.17-r1188) and clusters were aligned using MUSCLE (version 3.8.31), requiring a minimum depth of 6, which is the minimum depth at which a heterozygous base call can be distinguished from sequencing error. We jointly estimated heterozygosity and error rate by specifying a maximum of two alleles per site in each consensus sequence and removed alignments with a high proportion (5%) of heterozygous base calls, as poor alignments tend to have an excess of heterozygous sites. To remove poor alignments in repetitive regions in the final data set, we allowed for a maximum of 20% SNPs per locus, as well as removed alignments with more than 8 indels per locus. We set the maximum proportion of shared polymorphic sites in a locus to 0.5, as shared heterozygous sites across samples likely result from clustering of paralogs with fixed, rather than heterozygous sites, and excluded any samples with fewer than 100,000 reads. We retained all loci shared by 75% of samples in the dataset for analysis, as this provided a good balance between missingness and number of recovered SNPs in the analysis. Altogether for the landscape genetics analysis, we retained a sample of 49 individuals with 11,215 loci in the assembly and 25.9% missing sites in the final SNP matrix. We output a SNP-based formatted file of one randomly selected variable site per locus to calculate individual-based pairwise genetic distances calculated as a proportion of shared alleles for individuals collected over the extent of our sampling range.

For the phylogenetic dataset, we followed the same protocol as above for a sample of 200 S. occidentalis individuals that broadly encompass the Sierra Nevada range and one S. graciosus as an outgroup. We removed invariable sites from the alignment and retained all positions with data for at least 95% of the individuals (n=191) in the dataset. We obtained a dataset for the phylogenetic analysis with 27,872 variable sites in the final sequence matrix with 17.1% missing sites.

Funding

National Science Foundation, Award: DEB 1754350