Population genetics of the Four Corners potato (Solanum jamesii)
Data files
Feb 14, 2024 version files 6.82 MB
-
Appendix_S1.xlsx
-
Appendix_S3.xlsx
-
Appendix_S4.xlsx
-
Appendix_S5.xlsx
-
Appendix_S7.xlsx
-
README.md
Apr 30, 2024 version files 6.82 MB
-
Fst_Values.xlsx
-
Geographic_Patterns.xlsx
-
Population_Metadata.xlsx
-
README.md
-
SNP_Individual_Data.xlsx
-
SNP_Population_Data.xlsx
Abstract
Premise: The domestication of wild plant species can begin with gathering and transport of propagules by Indigenous peoples. The effect on genomic composition, especially in clonal, self-incompatible perennials would be instantaneous and drastic with respect to new, anthropogenic populations subsequently established. Reductions in genetic diversity and mating capability would be symptomatic and the presence of unique alleles and genetic sequences would reveal the origins and ancestry of populations associated with archaeological sites. The current distribution of the Four Corners potato, Solanum jamesii Torr. in the Southwestern USA, may thus reflect the early stages of a domestication process that began with tuber transport.
Methods: Herein genetic sequencing (GBS) data are used to further examine the hypothesis of domestication in this culturally significant species by sampling 25 archaeological and non-archaeological populations.
Key Results: Archaeological populations from Utah, Colorado, and northern Arizona have lower levels of polymorphic loci, unique alleles, and heterozygosity than non-archaeological populations from the Mogollon Region of central Arizona and New Mexico. Principle components analysis, Fst values, and structure analysis revealed that genetic relationships among archaeological populations did not correspond to geographic proximity. Populations in Escalante, Utah were related to those on the Mogollon Rim (400 km south) and had multiple origins and significant disjunctions with those in Bears Ears, Chaco Canyon, and Mesa Verde.
Conclusions: Movement of tubers from the Mogollon Region may have occurred many times and in multiple directions during the past, resulting in the complex genetic patterns seen in populations from across the Four Corners Region.
README: Population Genetics of the Four Corners Potato (Solanum jamesii)
https://doi.org/10.5061/dryad.cz8w9gj9w
Gene sequencing used to analyze 682 plants from 25 populations of the Four Corners Potato (Solanum jamesii).
Description of the data and file structure
Population Metadata - Metadata for 25 populations of Solanum jamesii sampled for genetic analysis. A = archaeological population, NA = nonarchaeological population. Population size refers to estimates of the typical number of stems observed at sample time.
Allelic Variation - Relationship between polymorphism and unique alleles in 25 populations of Solanum jamesii. Red points are populations associated with archaeology, blue points are non-archaeological populations. Arrows indicate overlapping points. Dashed lines represent 95% confidence intervals around the generalized linear model (GLM). Correlation coefficient (r) and probability (P) of the GLM are indicated. See 'Population Metadata' for population data. R code is published at Dryad (below).
Geographic Patterns - Latitudinal patterns of genetic diversity in archaeological (A, red) and non-arachaeological (NA, blue) populations of Solanum jamesii. Points representing differences between an A population (ELMA) and an N population (ELMC) that are separated by 1.5 km indicate a localized founder effect. See 'Population Metadata' for population data.
Fst Values - Genetic Differentiation between 25 populations of Solanum jamesii using Wright's Fst statistic (Wright 1965, Evolution 19:395–420. doi: 10.1111/j.1558-5646.1965.tb01731x). See spreadsheet for 'Population Metadata'.
SNP Population Data - Genomic SNP data for structure analysis of 25 populations of Solanum jamesii. Data for 680 individual plants (left) and mean values for populations (right), shown with figures (below).
*Clonal Variation *- Relationships between probability of sampling clones and measures of genetic diversity for archaeological (red) and non-archaeological (blue) populations of Solanum jamesii. Arrows indiate overlapping points. Dashed lines represent 95% confidence intervals around the generalized linear model (GLM). Correlation coefficient (r) and probability (P) of the GLM are indicated. See 'Population Metadata' for population data. R code is published at Dryad (below).
SNP Individual Data - Raw SNP data for 682 individuals from 25 populations of Solanum jamesii.
Sharing/Access information
N/A
Code/Software
R code used to generate plots in Appendices S2 and S6 using R Programming Language (R Core Team 2022). Raw data (% polymorphic loci, # unique alleles, P-clone %) for the 25 populations found in Appendix S1.
par(mar=c(5,6,4,2))
attach(dataset)
with(dataset, plot(y, x, ylab="Unique Alleles (#)", xlab="Polymorphic Loci (%)", xlim = c(20,80), pch=21, col="blue", bg="blue", cex=1.5, cex.lab=1.5, cex.axis=1.5))
GLM<-glm(y ~ x, family=poisson(link="log"), data=dataset)
summary(GLM)
curve(predict(GLM, data.frame(%polymorphicloci=x),type="resp"),add=TRUE, lwd=2)
Pred <- predict(GLM, data.frame("%polymorphicloci"=seq(20,80)), se = T, type="response")
with(Pred, lines(20:80, fit+1.96*se.fit, lty=2, lwd=1.5, col="black"))
with(Pred, lines(20:80, fit-1.96*se.fit, lty=2, lwd=1.5, col="black"))
Methods
Selection of sample sites – The S. jamesii accession database contains a total of 169 extant occurrences from across the entire range of the species (Bamberg et al., 2003, with updates), of which 36 are associated with ancient sites (e.g. pueblos, granaries, habitation structures, kivas). These were divided into three latitudinal strata, each containing roughly a third of the total occurrences. From each stratum, a total of 12 primary and secondary candidate sites were randomly chosen. Candidate sites were surveyed in 2019, 2020, and 2021, and depending on monsoon rainfall, sampled for DNA analysis (in dry years a secondary site might be substituted for a primary site if plants were not present in the latter). In all we were able to obtain leaf samples and site metadata from 14 archaeological (A) and 11 non-archaeological (NA) populations (Fig. 1 and Data available from Dryad at https://doi.org/10.5061/dryad.cz8w9gj9w).
Field sampling – Leaf samples of +30 individuals (optimized by Bamberg and del Rio, 2004) were obtained from multiple (6 to 8) widely separated patches in each of the 25 study populations of S. jamesii. There were placed in individual paper tea bags and immediately submerged in silicone gel and/or powder to desiccate. Each sampled individual was geo-referenced with sub-meter GPS (Eos Arrow 100 GNSS, Anatum GeoMobile Solutions, Portland, Oregon, USA), and notes taken on habitat and archeological features (if present within a 300m radius).
Sequencing
DNA extraction and GBS library construction – Plant DNA was extracted from tissue for each of the samples collected in the collecting trips using a QIACube HT (Qiagen, Germantown, Maryland, USA), an automated extraction robot, which makes high throughput extractions in 96-well plates. These DNA samples were then utilized for Genotyping by Sequencing (GBS), using 48- and 96-plex plates. The first step was to create a reduced representation of the genomes through restriction endonucleases. This cleaved genomic DNA into manageable fragments for PCR and sequencing. Previous work in Solanum jamesii determined that the enzyme combination of PstI and BfaI was the most effective for GBS (Bamberg et al., 2021). After the enzymatic cleavage the sticky ends of DNA were used as anchors for the ligation of barcoded adaptors that served (a) to identify each sample individually from the others during pooled DNA sequencing, (b) as primer binding sites for PCR amplification of the DNA before sequencing, and (c) to adhere other sequencing instrument specific adaptors. After barcode ligation the DNA was amplified via PCR, and then purified and quantified before sequencing. The sequencing was done using a NovaSeq 6000 Sequencer (Illumina, San Diego, California, USA), at the University of Wisconsin-Madison Biotechnology Center (UWBC, https://www. biotech.wisc.edu/).
GBS sequence analysis and SNP discovery – To perform the sequencing, reads having low-quality, unidentified bases, and parts of the adapter sequences were eliminated using computational pipelines developed at the Bioinformatics Resource Center (BRC), which is part of the Advanced Genome Analysis Resource unit of the UWBC. The software Skewer (Jiang et al., 2014) was applied to pre-process the raw fastq data files, removing remaining sequences of adapters and/or primers from reads used for sequencing. The reads found to be trimmed too short with poor quality, and those that did not meet minimum sequence length of 150 bp were discarded from downstream analysis. This process recovers true targeted DNA sequences. Reads were then checked for quality using FastQC; each sequence was recognized by its unique barcode adapter (https://sourceforge.net/projects/gbsbarcode/). Raw sequences were processed using Tassel 3.0 and Universal Network Enabled Analysis Kit (UNEAK) pipelines (Lu et al., 2013) for SNP discovery.
The bioinformatics computational steps were done on the Unix platform Zcluster at the UWBC. The raw reads were analyzed for quality and sequences were trimmed from the 3’ end of the reads until reaching a Phred quality of 20 (Ewing et al. 1998). These were de-multiplexed using GBSSeqToTagDBPlugin, a step in Tassel GBS Pipeline, generating high-quality reads for each genotype. Genebank accession number GCA_000226075.1 was used as the Reference Genome (Massa et al. 2011) and reads were mapped using Bowtie 2 alignment software (Langmead and Salzberg 2012). SAMtools version 1.3.1 (Li et al., 2009) was used for analyzing and manipulating alignments. A SNP master matrix for each plant collected in the 25 S. jamesii populations was created after the pipeline for SNP discovery and genotype calling using TASSEL reference-based GBS analysis pipeline for SNP calling (Glaubitz et al. 2014). The pipeline for SNP discovery followed standard conditions of depth coverage (minDP≥2), maximum mismatch for alignment (n=3), sufficient data for imputation (i.e. SNP sites with data for >=80% of samples), and minimum Minor Allele Frequency (MAF≥0.05).
Data Analysis
To determine levels of genetic diversity in the populations, SNP markers were used to calculate genetic relationships among populations and standard estimates of population genetic structure using GenAlex software version 6.5 (Peakall and Smouse, 2006). These include levels of polymorphism, heterozygosity, and number of unique alleles. Principal Component (PC) analyses were also generated with GenAlex to present visual representations of the genetic diversity and differentiation among populations. The purpose of PC was to organize the populations in a space of reduced dimensionality while keeping their genetic identities and divergence, especially in this study where a very large number of individual plants were genotyped.
An analysis to determine patterns in the distribution of genetic diversity among the 25 populations was done with the software STRUCTURE 2.3.4 (Pritchard et al., 2000). Hence, a Bayesian approach was used to estimate posterior probabilities to support whether individuals belong to a given K source population by placing individuals into clusters sharing similar patterns of variation. This was done for all the individual plants collected from each population at the sampling sites. An admixture model was assumed for the analyses. The burn-in generation and the MCMC were set to 250,000, with 20 iterations. Delta K, the optimal number of clusters (K) for the sample set, was estimated using the Evanno method (Evanno et al., 2005; Earl and Vonholdt, 2012) through STRUCTURE Harvester.
The probability of sampling from the same ramet (“P-clone”) in a population was roughly estimated by tallying the number of individual stems sampled (plants) that had a coefficient of genetic similarity > 0.975 estimated as the complement of Nei’s Genetic Distance coefficient (Nei 1972) and dividing by the total number of stems sampled. Having maximized the distance between samples during collection, this estimate roughly indicates the degree to which cloning (tubering) demographically structures each population.