Data From: what mandrills leave behind: using fecal samples to characterize the major histocompatibility complex in a threatened primate
Data files
Jan 23, 2024 version files 461.47 KB
-
2018_seqfreqs.csv
26.58 KB
-
2021_seqfreqs.csv
26.20 KB
-
all_identified_alleles.fas
10.83 KB
-
consensus_alleles.csv
30.67 KB
-
consensus_fasta_files.zip
71.68 KB
-
DOC_assignments_per_run.zip
286.93 KB
-
README.md
8.57 KB
Abstract
The major histocompatibility complex (MHC) can be useful in guiding conservation planning because of its influence on immunity, fitness, and reproductive ecology in vertebrates. The mandrill (Mandrillus sphinx) is a threatened primate endemic to central Africa. Considerable research in this species has shown that the MHC is important for disease resistance, mate choice, and reproductive success. However, all previous MHC research in mandrills has focused on an inbred semi-captive population, so their genetic diversity may have been underestimated. Here we expand our current knowledge of mandrill MHC variation by performing next-generation sequencing of non-invasively collected fecal samples from a large wild horde in central Gabon. We observe MHC lineages and alleles shared with other primates, and we uncover 45 putative new class II MHC DRB alleles, including representatives of the DRB9 pseudogene, which has not previously been identified in mandrills. We also document methodological challenges associated with fecal samples in NGS-based MHC research. Even with high read depth, the replicability of alleles from fecal samples was lower than that of tissue samples, and allele assignments are inconsistent between sample types. Further, the common assumption that variants with very high read depth should represent true alleles does not appear to be reliable for fecal samples. Nevertheless, the use of degraded DNA in the present study still enabled significant progress in quantifying immunogenetic diversity and its evolution in wild primates.
README: Data From: What mandrills leave behind: using fecal samples to characterize the major histocompatibility complex in a threatened primate
Jan. 12, 2024
One aim of this study was to quantify replicability of Illumina sequencing results from fecal samples and higher-quality samples of blood and plucked hair. Samples of both types were sequenced in at least two independent Illumina runs, and an allele assignment was determined using the degree of change method(Lighten et al. 2014). We then calculated the proportion of sequence variants that replicated between runs for each sample to represent a replicability score(RA), and compared average RA scores for feces vs blood/hair. The data used for these tests are contained in the folder 'DOC_assignments_per_run'. We also tested whether replicability is related to sequence depth or the sequence's rank within the amplicon. These data are also available in this package.
A second aim of this study was to characterize functional genetic diversity in the wild mandrill population and to assess the presence of trans-species polymorphism. To that end, we generated consensus MHC-DRB allele assignments for 181 mandrills from Lope National Park (172 fecal samples and 9 samples oxblood and plucked hair). Each sample was sequenced in two independent Illumina runs, and consensus assignments were determined by extracting variants that appeared in both runs and then screening those variants for sequencing error. The allele assignments and other supporting data are contained here. These can be used to replicate the analyses of genetic diversity from the manuscript: creating a phylogeny, identifying super types and sites under positive selection, and comparisons with other primate species.
Description of the Data and file structure
FOLDER 'consensus_fasta_files'
This folder contains fasta files of each mandrill's putative consensus MHC allele assignments determined in the study (generated using the script 'generate_consensus_MHC_assignments.py', followed by manual error screening). Fecal samples (n=172) all have the prefix "LP," followed by a 3 digit code (ex-LP596.fas). Blood/hair samples (n=9) have a 2-digit code representing the individual animal, then three letters (hmb, for example), showing the sample type (hair-h, blood-b), sex (male-m, female-f), and age class (b-reproductive adult). Ex NG-hmb.fas
FOLDER 'DOC_assignments_per_run'
This folder contains:
subfolder '2018MiSeq','2019MiSeqNano', and '2021MiSeq'. These are results of the degree of change method of allele
assignment in fasta format for each sample in each Illumina run. These data were used to generate replicability scores
(RA) by inputting them to a custom Python code ('MHC_replicability_test.py') that produced the .csv files
described in point 2 below.3 csv files ending in 'rep-test-out.csv' (generated using the code 'MHC_replicability_test.py'. These include the
results of the replicability tests described above, for each pair of Illumina runs. Data are organized in rows, where
each row is a sample. Columns are as follows:sampleID sample used for replicability test
#sharedalleles number of DOC-identified alleles appearing in both runs for this sample
#2018alleles number of DOC-identified alleles appearing in the 2018 run for this sample
#2021alleles number of DOC-identified alleles appearing in the 2021 run for this sample
replicability.score proportion of alleles that replicated (#sharedalleles divided by (#2018alleles+#2021alleles-#sharedalleles))
2018shared.freq average frequency of alleles in the 2018 run that also appeared in the 2021 run ([variant depth in 2018/amplicon depth in 2018]*100). NA if no replicated variants
2018shared.sings the number of alleles in the 2018 run that appear in only one animal, but are replicated for this sample in the 2021 run
2021shared.freq average frequency of alleles in the 2021 run that also appeared in the 2018 run ([variant depth in 2021/amplicon depth in 2021]*100). NA if no replicated variants
2021shared.sings the number of alleles in the 2021 run that appear in only one animal, but are replicated for this sample in the 2018 run
2018only.freq average frequency of alleles identified for this sample that only appeared in the 2018 run ([variant depth in 2018/amplicon depth in 2018]*100)
2018only.sings the number of alleles identified for this sample in the 2018 run that are unique to this sample (i.e. only detected in one animal, not replicated between seq. runs)
2021only.freq average frequency of alleles identified for this sample that only appeared in the 2021 run ([variant depth in 2018/amplicon depth in 2018]*100)
2021only.sings the number of alleles identified for this sample in the 2021 run that are unique to this sample (i.e. only detected in one animal, not replicated between seq. runs)
FILE '2021_seqfreqs.csv' and '2018_seqfreqs.csv' (generated using the code 'allelefrequency_repstatus_per_run.py').
These files can be used to recreate the logistic regression anaylses in the manuscript that tested whether a variant's replicability is related
to its depth or its rank within the amplicon. Data in each file corresponds to 2021 and 2018 Miseq sequencing runs, and
lists all DOC-identified alleles present in each sample in the relevant run. Column headings are as follows:
sampleID the sample containing the variant. Each sample will have multiple rows, one for each of its DOC-identified alleles
allele name of a variant in the sample (may not match variant names from fasta files--recommend to disregard this column)
frequency frequency of the variant within the amplicon ([variant's depth/amplicon depth]*100)
replication a binary value indicating whether or not the allele replicated in the other sequencing run
rank rank of the variant, when ranked by depth (ranked from high to low, 1....n, where n=number of alleles possessed by the sample)
FILE 'all_identified_alleles.fas' is a fasta file of all alleles identified in the focal population. These can be used to recreate the analyses of genetic diversity from the manuscript: creating a phylogeny, identifying supertypes and sites under positive selection, and comparisons with other primate species.
FILE 'consensus_alleles.csv' is a master file containing all alleles and all samples with allele assignments. Generally, samples are in columns and alleles are in rows, with some exceptions. The first three rows contain data about each sample (amplicon depth, depth of all alleles, and number of alleles identified).
Depths listed are from the 2021 Illumina run. The first six columns have the following headings:
SEQUENCE allele sequence
LENGTH allele length
SAMPLES number of samples possessing the allele
Clade clade 1 (putative loci DRB1-6) or clade 2 (putative DRB9 locus)
ST supertype, designated 1-6, representing predicted functional group based on physiochemical properties of the sequence's putative antigen binding amino acid sites
Allele Name designated allele name all subsequent columns represent individual samples
SCRIPT 'allelefrequency_repstatus_per_run.py' Python code used to compare fasta files of the same samples in two Illumina runs.
For each run, for all variants in all samples, data is extracted that shows 1) within-amplicon frequency of variant, 2)
variank rank within amplicon, 3) whether the variant replicated. Produced files '2018_seqfreqs.csv' and
'2021_seqfreqs.csv.'
SCRIPT 'generate_consensus_MHC_assignments.py' Python code used to extract each sample's sequence variants that replicated between Illumina runs. This was applied to two folders of ampliSAT-generated fasta files. Resulting variants were then aligned in MEGA-X and visually screened for sequencing error. After removal of artifical sequences, resulting fasta files were stored in the 'consensus_fasta_files' folder.
SCRIPT 'MHC_replicability_test.py' Python code to calculate replicability scores (RA) from pairs of ampliSAT-generated fasta files. For each sample, the output shows the number of variants that replicated between runs, number of variants appearing in each sequencing run, number of singletons, and the average frequencies of these variants. This code produced the 'rep-test-out.csv' files described above.
Sharing/access Information
Raw NGS data may be uploaded to the Genbank SRA in the future. This document will be updated to reflect this.
Microsatellite data used in the study is forthcoming in a separate DataDryad upload.
Methods
Further details of data analysis can be found in the publication manuscript, but in brief, samples were collected from a wild population of mandrills (Mandrillus sphinx) in Lopé National Park, Gabon. The majority of the samples are non-invasively collected feces, but samples of blood and plucked hair were also collected from anesthetized animals. Each sample was PCR-amplified for a 157-base fragment of the DRB gene of the class II major histocompatibility complex, then sequenced on Illumina Miseq. An initial Miseq run was performed in 2018, containing 192 fecal samples. In 2019, a Miseq Nano run was performed, containing the 24 samples of blood and plucked hair along with 23 replicate fecal samples from the 2018 sequencing run. In 2021, the third sequencing run was performed using standard Miseq, and this run included nine replicates of the blood and hair samples and 183 replicated fecal samples.
To quantify repeatability between runs, each Illumina run was processed separately. Pre-processing was performed using the ampliSAT pipeline (Sebastian et al. 2016), and alleles were assigned to each sample in each run using the degree of change method (Lighten et al. 2014). For samples that were replicated between runs, a replicability score (RA) was calculated based on the proportion of variants that appeared in both runs. A custom Python script was used to compare allele assignments of samples in each pair of Illumina runs.
To generate consensus allele assignments for each sample, each Illumina run was reanalyzed in the AmpliSAT pipeline, using more relaxed parameters in order to minimize the chance of allelic dropout. Then, a custom Python script was used to extract sequence variants that replicated between runs for each sample. This set of replicated variants for each sample was then screened for errors such as replicated chimeric sequences. After discarding such sequence artifacts, the set of variants that remained was considered the individual's "true" allele assignment.
Usage notes
Most files can be opened using an open source text editor such as JEdit or Notepad++. The .csv files will be easist to view in a spreadsheet such as Excel or Google Sheets, but they can also be opened in a text editor if needed. Users may wish to open fasta files using an open source genetics software such as MEGA-X.