Skip to main content

Relationship between genome-wide and MHC class I and II genetic diversity and complementarity in a nonhuman primate

Cite this dataset

Petersen, Rachel; Bergey, Christina; Roos, Christian; Higham, James (2022). Relationship between genome-wide and MHC class I and II genetic diversity and complementarity in a nonhuman primate [Dataset]. Dryad.


Although mate choice is expected to favor partners with advantageous genetic properties, the relative importance of genome-wide characteristics, such as overall heterozygosity or kinship, versus specific loci, is unknown. To disentangle genome-wide and locus-specific targets of mate choice, we must first understand congruence in global and local variation within the same individual. This study compares genetic diversity, both absolute and relative to other individuals (e.g., complementarity), assessed across the genome to that found at the major histocompatibility complex (MHC), a hyper-variable gene family integral to immune system function and implicated in mate choice across species. Using DNA from 22 captive olive baboons (Papio anubis), we conducted double digest restriction-site associated DNA sequencing to estimate genome-wide heterozygosity and kinship and sequenced two class I and two class II MHC loci. We found that genome-wide diversity was not associated with MHC diversity, and that diversity at class I MHC loci was not correlated with diversity at class II loci. Additionally, kinship was a significant predictor of the number of MHC alleles shared between dyads at class II loci. Our results provide further evidence of the strong selective pressures maintaining genetic diversity at the MHC in comparison to other randomly selected sites throughout the genome. Furthermore, our results indicate that class II MHC disassortative mate choice may mediate inbreeding avoidance in this population. Our study suggests that mate choice favoring genome-wide genetic diversity is not always synonymous with mate choice favoring MHC diversity, and highlights the importance of controlling for kinship when investigating MHC-associated mate choice.


Dataset generation

Sampling and DNA extraction

We obtained whole blood samples from 22 olive baboons (7 males, 15 females) living at Le Centre National de la Recherche Scientifique Station de Primatologie (CNRS SdP) in Rousset-sur-Arc, France. We extracted DNA using either the Qiagen QIAamp DNA mini kit (N=7; 4 females and 3 males) or GEN-IAL First-DNA All tissue kit (N=15; 11 females and 4 males) following the manufacturer’s instructions. 

Genome-wide SNP genotyping

We prepared ddRAD sequencing libraries following Peterson et al. (2012). We digested 1µg of DNA using 10 units of two restriction enzymes (SphI and MluCI), and excluded fragments outside of a 185 ± 19 bp target window using the automated Blue Pippin System and 2% Agarose Gel Cassettes. We ligated Illumina platform adapters customized for enzyme cut sites and cleaned products using AMPure XP beads. We individually indexed samples using NEBNext Multiplex Oligos for Illumina sequencing, conducted a second round of size selection, and sequenced the libraries on one lane of the Illumina HiSeq 2500 with 150 bp paired-end reads (comprehensive methods available in the supplementary materials).

Bioinformatic analyses

Genome-wide SNP genotyping

We cleaned and filtered ddRAD sequences using the program STACKS v.2 (Rochette, Rivera‐Colón, & Catchen, 2019), excluding reads with low-quality scores or without both enzyme cut sites. We mapped the filtered reads to the Papio anubis reference genome (NCBI Panu v. 3.0; accession no GCA_000264685.2) using the Burrows-Wheeler Aligner “mem” algorithm with default parameters (Li, 2013). We performed shared SNP calling using the STACKS reference mapping pipeline (Rochette et al., 2019), requiring loci to be present in at least 80% of individuals to be included in the SNP catalog. To generate a more accurate estimate of genome-wide heterozygosity for each individual, we excluded SNPs in strong linkage disequilibrium (LD) by scanning sequences in a sliding window and removing at random SNPs with a probability of LD (r2) greater than or equal to 0.5 (--indep-pairwise 50 5 0.5 in PLINK; Purcell et al., 2007). These sequence reads have been submitted to NCBI’s Sequence Read Archive (SRA). To assess whether our study population displays the genetic signatures of a bottleneck or inbreeding, both of which may influence genetic diversity (Groombridge et al. 2000, Wisely et al. 2002), we compared the genetic diversity present in our study population to that observed in wild populations by merging our SNP dataset with two previously published Papio datasets (Bergey, 2015; Rogers et al., 2019). Once again, we pruned sites in linkage disequilibrium and estimated individual heterozygosity in PLINK (Purcell et al., 2007).

MHC sequencing

We filtered MHC sequences for quality using Trimmomatic (Settings: Leading:28, Trailing:28, Window:4:25, MinLen: 150) (Bolger, Lohse, & Usadel, 2014) and merged forward and reverse reads using PandaSeq v.2.11 (Masella, Bartram, Truszkowski, Brown, & Neufeld, 2012).  We mapped reads to known olive baboon MHC-A, B, DQA, and DRB loci taken from the IPD-MHC database (, split by contig, and filtered for target length of each amplicon. To identify PCR and sequencing artifacts, we performed rare K-mer filtering, using a scanning window of 15bp, and excluded sequences containing 12 K-mers in a row falling beneath 15% of the median read coverage. We then collapsed identical sequences and kept only sequences comprising, at minimum, 5% of the total number of filtered reads for that locus (Barbian et al., 2018).  Additionally, we kept sequences that had a copy number >1000 that were also present at >5% of total copy number in another individual. We chose this threshold because most loci displayed a natural 10 to 100-fold drop in sequence copy number when comparing the last sequence present at >1000 copies versus the first sequence present at <1000 copies (Supplementary figure S1). Within MHC-DRB, the most thoroughly characterized loci of the four studied here, 26 out of the 27 retained sequences had previously been identified in other studies, while those falling below the 5% copy number threshold had not been previously identified, suggesting that reads falling below our filtering criteria were likely generated from sequencing error. After filtering, we aligned the retained reads to all known Papio anubis MHC sequences using BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990) and identified previously known sequences as those matching published sequences with a 100% identity and 0 gaps. Sequences possessing <100% identity or >0 gaps were distinguished as new MHC alleles.  These sequence data have been submitted to GenBank.

MHC supertype analyses

We defined MHC supertypes using the physiochemical properties of amino acids located at positively selected sites (PSS). We identified PSS by comparing rates of synonymous (dS) to non-synonymous (dN) nucleotide substitutions in protein-coding regions using methods described by Goodswen et al. (2018). The dN/dS method was originally proposed to identify selective pressures in distantly diverged sequences of independent lineages and is less sensitive to identifying selective pressures in sequences derived from individuals within the same population (Kimura, 1979; Kryazhimskiy & Plotkin, 2008). However, within the MHC gene family, amino acid polymorphisms are most often found at locations involved in antigen binding, and thus this measure of positive selection provides insight into the location of functionally important amino acid sites (Bjorkman et al., 1987; Brown et al., 1988).

We determined open reading frames by performing a BLAST alignment to the IPD-MHC database. We translated sequences in R using the package ‘seqinr’ v4.2.8 (Charif & Lobry, 2007) and performed multiple protein sequence alignment in MAFFT v7 (Katoh & Standley, 2013).  We converted protein alignments into codon alignments using PAL2NAL v14 (Suyama, Torrents, & Bork, 2006) and constructed a maximum likelihood tree of the alignments using RAxML v2 (Stamatakis, 2014) and a generalized time reversible (GTR) GAMMA substitution model, with the best-scoring tree selected using 100 bootstrap iterations. We then computed substitution rate ratios (dN/dS) by inputting the PAL2NAL codon alignment and RAxML tree into the CODEML program of the PAML v4.9 package (Yang, 2007). This software identifies statistically significant PSS using the Bayes Empirical Bayes (BEB) analysis computed under NSsite model 8 (Yang, Wong, & Nielsen, 2005).

Next, we aligned the amino acids associated with each PSS and described the physiochemical properties of each site in the form of five z-descriptors: z1 (hydrophobicity), z2 (steric bulk), z3 (polarity), z4, and z5 (electronic effects) (Sandberg, Eriksson, Jonsson, Sjöström, & Wold, 1998).  We compiled a mathematical matrix containing the five z-scores of each PSS of each allele and performed an agglomerative hierarchical clustering analysis using Euclidian distance and the average linkage method with the R function ‘hclust’ in the ‘stats’ package v4.1.2 (R Core Team, 2021). We used the R package ‘dynamicTreeCut’ v1.63.1 (Langfelder, Zhang, & Horvath, 2014) to identify significant clusters, while specifying a minimum cluster size of 2 (Greenbaum et al., 2011). These methods for determining MHC supertypes have been shown to identify biologically relevant variation in MHC allele functionality in both human and non-human primate studies (Lund et al., 2004; Schwensow, Fietz, Dausmann, & Sommer, 2007).

Usage notes



National Science Foundation, Award: 1826804

Sigma Xi

NYU Intramural Funds