Skip to main content

Savanna monkey (Chlorocebus spp.) population genetics/genomics pipeline

Cite this dataset

Schmitt, Christopher et al. (2022). Savanna monkey (Chlorocebus spp.) population genetics/genomics pipeline [Dataset]. Dryad.


In the last 300 thousand years, the genus Chlorocebus expanded from equatorial Africa into the southernmost latitudes of the continent, where colder climate was a likely driver of natural selection. We investigated population-level genetic variation in the mitochondrial uncoupling protein 1 (UCP1) gene region—implicated in non-shivering thermogenesis (NST)— in 73 wild savanna monkeys from three taxa representing this southern expansion (Chlorocebus pygerythrus hilgerti, Chlorocebus cynosuros and Chlorocebus pygerythrus pygerythrus) ranging from Kenya to South Africa. We found 17 SNPs with extended haplotype homozygosity consistent with positive selective sweeps, 10 of which show no significant linkage disequilibrium with each other. Phylogenetic generalized least-squares modelling with ecological covariates suggests that most derived allele frequencies are significantly associated with solar irradiance and winter precipitation, rather than overall low temperatures. This selection and association with irradiance are demonstrated by a relatively isolated population in the southern coastal belt of South Africa. We suggest that sunbathing behaviours common to savanna monkeys, in combination with the strength of solar irradiance, may mediate adaptations to thermal stress via NST among savanna monkeys. The variants we discovered all lie in non-coding regions, some with previously documented regulatory functions, calling for further validation and research.


Study Populations

Data for this project were generated by the International Vervet Research Consortium from a sample of 163 savanna monkeys (Chlorocebus spp.) from 6 closely related taxa, captured as part of a larger sampling effort in 11 countries across Africa and the Caribbean, with particularly extensive sampling in South Africa (Jasinska et al., 2013; Svardal et al., 2017; Turner et al., 2019). We focused on 73 individuals from 3 vervet taxa: Chlorocebus pygerythrus hilgerti, Chlorocebus cynosuros, and Chlorocebus pygerythrus pygerythrus. In this paper, we adhere roughly to taxonomic distinctions made by Groves (2001), although recent nuclear and mitochondrial genomic evidence both suggest a relatively deep division suggesting a species-level distinction between C. p. hilgerti and C. p. pygerythrus (Svardal et al., 2017; Dolotovskaya et al., 2017), while evidence of gene flow between these two taxa and C. cynosuros suggest that the latter is nested within the larger C. pygerythrus clade (as first proposed by Dandelot, 1959). A taxonomic revision may be in order for the genus (Svardal et al., 2017), but is not within the scope of this paper. Each taxon was further subdivided into distinct populations, when possible, based on inferences from the whole-genome phylogeny (see below). 

Sequence Data

Whole genome sequences were previously generated by collaborators at the McDonnell Genome Institute at Washington University in St. Louis (Warren et al., 2015; Svardal et al., 2017), and analyzed in this study using a publicly available variant call format (VCF) file generated at the Gregor Mendel Institute (Svardal et al., 2017). We used tabix (Li, 2011) in HTSlib v1.10.2 and vcftools (Danecek et al., 2011) to isolate a ~28 kb gene region around UCP1 (ChlSab1.1 positions 7:87,492,195-87,502,665), including 10 kb upstream and downstream of the coding region itself to capture potential cis-regulatory regions. 

Whole-Genome Phylogeny

We used SNPRelate v. 1.24.0 (Zheng et al., 2012) to generate FASTA alignments from the VCF file, and phangorn (v2.5.5; Schliep, 2011), phytools (v0.6-99; Revell, 2012), and geiger (v2.0.6.4; Pennell et al., 2014) to construct a neighbor-joining tree with a Jukes-Cantor mutation model representing whole-genome phylogenetic relationships for all 163 individuals in our original sample. We pruned this phylogeny to drop tips not represented in the study population sample.

Population Structure

We used discriminant analyses of principal components (DAPCs) in adegenet v.2.1.2 (Jombart, 2008; Jombart & Ahmed, 2011) and calculated the Fixation Index (FST) using hierfstat v.0.04-22 (Goudet, 2005) to model population structure. To statistically validate the patterns noted, we ran analyses of molecular variance (AMOVA) in poppr v. 2.8.6 (Kanvar et al., 2014). In both DAPC and FST analyses we used sample population as a grouping variable, while in AMOVA we used population nested within taxon. In the FST analysis, we used the Nei87 setting, based on Nei’s (1987) method, to assess genetic distance between populations. We also ran principal component and admixture analysis using LEA v. 3.0.0 (Frichot & François, 2015), using the snmf function with standard settings to visualize entropy criterion values to define K levels of population differentiation. To model isolation by distance, we assessed the correlation between genetic and geographic distance matrices in our sample using Mantel tests implemented in vegan v. 2.5-6, using the Pearson method (Oksanen et al., 2019). We constructed our genetic distance matrix in poppr, also using Nei’s method (Nei, 1978), and our geographic distance matrix from GPS points associated with trapping locations using raster v. 3.4-5 (Hijmans & van Etten, 2012).

Assessing Selection

We calculated Hardy Weinberg Equilibrium (HWE) values with pegas (v. 0.12; Paradis, 2010) to identify candidate loci potentially experiencing selective forces, both within the whole southern expansion and in local populations. We assessed linkage disequilibrium (LD) with gpart v.1.6.0 (Kim and Yoo, 2020), using standard BigLD setting and the r2 method (to avoid the “ceiling effect” common to using D’ in small samples; Marroni et al., 2011). SNPs considered to be in LD with each other (r2 > 0.85) were subsumed into a single representative locus for downstream analyses. We calculated site-frequency spectrum statistics for the whole gene region including Tajima’s D, and Fu and Li’s D* and F* (PopGenome v.2.7.5; Pfeifer et al., 2014, pegas v.0.12; Paradis, 2010), as well as a sliding window Tajima’s D in vcftools with a window size of 500 bp. We compared UCP1 regional values of Tajima’s D and Fu and Li’s D* and F* to those from a sample of 1000 random, non-overlapping regions of equivalent size from vervet chromosome 7 to assess relative significance.

To calculate integrated haplotype scores (iHS) and assess extended haplotype homozygosity (EHH) both across the whole gene region and for selected UCP1 loci, respectively, we inferred the ancestral allele sequence for our sample population using the program Est-sfs v.2.03 using Kimura’s mutation model (Keightley & Jackson, 2018). We chose the rhesus macaque reference (Macaca mulatta; BCM Mmul_8.0.1/rheMac8) as our outgroup, which we downloaded using biomaRt v. 2.44.4 (Durnick et al., 2009; Durnick et al., 2005). We aligned it to the vervet reference genome (Chlorocebus sabaeus; Chlorocebus_sabeus 1.1/chlSab2) using rMSA v. 0.99.0 (Hahsler & Manguy, 2020) and MAFFT v. 7.467 (Katoh & Standley, 2013) using standard settings and trimmed the macaque reference to the vervet extent visually in JalView v. (Waterhouse et al., 2009). Ancestral alleles were assigned to the major vervet allele when the probability assigned by Est-sfs was above 0.70, and to the minor allele if below 0.30. When the probability was between these benchmarks, we assigned the ancestral allele to the allele shared by the two outgroups if it matched the population allele; when there was no concordance between the two outgroups (n = 5), we chose the Chlorocebus reference allele. We used rehh v.3.0.1 (Gautier et al., 2017) to estimate iHS and EHH using standard settings, but with a frequency bin of 0.15 to calculate iHS. We used a significance threshold of 2 for absolute iHS values (Voight et al., 2006) with a window size of 3000 bp and an overlap of 300 bp. We used a significance threshold value of 1.3 for considering individual SNPs for inclusion in further analyses.

Ecological Covariates

We used GPS coordinates recorded at each trapping location to download altitude, annual mean temperature, mean temperature of the coldest month, winter precipitation levels, and mean temperature of the wet season, among other ecological covariates, for each population for the 10-year period from 2005-2010 from the WorldClim2 online database (Fick & Hijmans, 2017). We also collected data on mean annual solar irradiance (measured in MJ/m2/day) for these points from the 10-year period from 2005-2010, originally generated by the NASA Langley Research Center (LaRC) POWER Project funded through the NASA Earth Science/Applied Science Program, using nasapower 3.0.1 (Sparks, 2018). We standardized all covariates using z-scores, and then used the package PerfomanceAnalytics v. 2.0.4 (Peterson et al., 2020) to reduce strongly correlated covariates.

Modeling Allele Frequency by Ecological Covariates

We used phylogenetic generalized least squares (PGLS) regression, implemented in the package nlme v. 3.1-150 (Pinheiro et al., 2021), to model variation in derived allele frequency of each target locus by geoclimatic variables including latitude, elevation, insolation/irradiation, annual mean temperature, mean temperature of the coldest month, and mean winter temperature. We incorporated our phylogenomic tree into our models using a Brownian correlation structure to account for average genetic distance across populations. We then used an information theoretic approach (Burnham & Anderson, 2002) to assess ecological covariate inclusion for each locus putatively experiencing selection, using the lowest Akaike Information Criterion modified for small sample sizes (AICc) to select the appropriate model.

Usage notes

All programs and software necessary for analysis of these datasets are free and open source, primarily used in R or UNIX/LINUX environments. Instructions for download and use are in the associated R Markdown code, and may also be followed using the associated HTML file knit from that Markdown code.


Boston University

National Institutes of Health, Award: R01RR0163009

Office of the Director, Award: R01OD010980

National Center for Advancing Translational Sciences, Award: R01OD010980/OD/NIH

National Center for Advancing Translational Sciences, Award: R01RR016300/RR/NCRR