Skip to main content
Dryad logo

16S V4 raw read count data; 16S reads metadata; new MHC class II allele sequences

Citation

Grieves, Leanne (2021), 16S V4 raw read count data; 16S reads metadata; new MHC class II allele sequences, Dryad, Dataset, https://doi.org/10.5061/dryad.4f4qrfj9t

Abstract

Pathogen-mediated selection at the major histocompatibility complex (MHC) is thought to promote MHC-based mate choice in vertebrates. Mounting evidence implicates odour in conveying MHC genotype, but the underlying mechanisms remain uncertain. MHC effects on odour may be mediated by odour-producing symbiotic microbes whose community structure is shaped by MHC genotype. In birds, preen oil is the primary source of body odour and similarity at MHC predicts similarity in preen oil composition. Hypothesizing that this relationship is mediated by symbiotic microbes, we characterized MHC genotype, preen gland microbial communities, and preen oil chemistry of song sparrows (Melospiza melodia). Consistent with the microbial mediation hypothesis, pairwise similarity at MHC predicted similarity in preen gland microbiota. Overall microbial similarity did not predict chemical similarity of preen oil, counter to this hypothesis. However, permutation testing identified a maximally predictive set of microbial taxa that best reflect MHC genotype, and another set of taxa that best predict preen oil chemical composition. The relative strengths of relationships between MHC and microbes, microbes and preen oil, and MHC and preen oil suggest that MHC may affect host odour both directly and indirectly. Thus, birds may assess MHC genotypes based on both host-associated and microbially-mediated odours.

Methods

We extracted bacterial DNA from swabs using DNeasy PowerSoil DNA isolation kits (Qiagen) with some modifications to the manufacturer’s recommended protocol (see supplementary material for detailed protocol). We amplified the V4 region of the bacterial 16S rRNA gene using the universal primers F518 [20] and R806 [21]. In addition to the priming sequences, each primer included an Illumina MiSeq adaptor sequence, four randomized nucleotides, and a unique ‘barcode’ of eight nucleotides. We performed PCR in a total volume of 25 µL, including 10 µL of 5PRIME HotMasterMix (Quantabio), 0.2 µM of each primer, and 2 µL of DNA template (mean concentration = 0.1 ng/µL). The thermocycling profile consisted of 2 min at 94 °C; 35 cycles of 45 s at 94 °C, 60 s at 50 °C and 90 s at 72 °C; and a 10 min final extension at 72 °C. Amplification was confirmed by running samples on a 2% agarose gel.

Sequencing and pipeline

We pooled PCR products into a library and sequenced with 250 nt paired-end reads on an Illumina MiSeq at the London Regional Genomics Centre. We used a pipeline [22] to collapse sequences into clusters of identical reads and assign sequences to individuals. We used a second pipeline [23] and the R package dada2 [24] to overlap reads, remove ambiguous reads, and filter out chimeras and singleton sequences (i.e., those appearing only once in the dataset) and those rarer than 0.1% in any sample. We assigned each unique sequence variant (SV) to bacterial taxon by clustering at ≥ 97% sequence identity (following [22]) using the naïve Bayesian Ribosomal Database Project (RDP) Classifier [24,25].

We used a compositional data analysis approach [26] that examines the read ratios between sequences. Most datasets do not actually contain all possible components; often, small values, including values below the detection limit of an instrument such as the Illumina MiSeq, are rounded off to zero. In such cases, zero counts reflect sampling or equipment limitations [27]. Accordingly, following [23], we used Bayesian-multiplicative replacement to impute values for zero count sequences using the R package zCompositions [27]. We then applied a centred log-ratio (clr) transformation to the zero-replaced data set, which renders the use of Euclidean distances meaningful in subsequent analyses [23,28].

Next, because rare sequences that occur in only a few samples are generally uninformative, and because samples with very low read counts are more likely to represent undersampling, we filtered sequences by the minimum proportion, minimum occurrence, and minimum sample count of reads. Thus, sequences found in fewer than 0.5% of reads, sequences found in fewer than 10% of samples, and samples with fewer than 5000 reads were removed from the initial dataset. The filtered dataset contained 47 SVs from across the 31 samples (Table S1). Detailed methods and quality control procedures are outlined in the supplementary materials.

MHC genetic analysis

We amplified the hypervariable second exon of MHC class II (~ 350 nt) using primers SospMHCint1f [15] and Int2r.1 [29]. In addition to the priming sequences, each primer included an Illumina MiSeq adaptor sequence, four randomized nucleotides, and a unique ‘barcode’ of eight nucleotides. Detailed PCR conditions and MHC sequencing methods are described elsewhere [6]. Briefly, we aligned amino acid sequences in MEGA v. 7.0 [30] and trimmed based on comparison to conspecific sequences in GenBank [31]. Trimming resulted in alleles of 73 – 86 amino acids, corresponding to most of exon 2. Next, we used a pipeline [22] to collapse sequences into clusters of identical reads and assign sequences to individuals, retaining sequences comprising at least 1% of an individual’s reads (mean ± SE retained reads per individual = 20 736 ± 1939).

We assigned each retained sequence to its corresponding protein sequence, removed any putative pseudogenes following [15], and applied Bayesian-multiplicative replacement and clr-transformation to the data to allow comparison to the microbial dataset. In some cases, different DNA sequence reads translated to the same amino acid sequence. For these, we calculated the average log-ratio value so that only unique protein sequences were included in further analysis. Across all 31 birds, we detected 151 unique amino acid alleles (mean ± SE alleles per individual = 16.23 ± 0.61).

Usage Notes

Detailed bacterial DNA extraction protocol

We extracted bacterial DNA from swabs using Qiagen DNeasy PowerSoil DNA isolation kits, with some modifications to the manufacturer’s recommended protocol.

Before starting the protocol, we added an initial saturation step in which we placed the swabs in the PowerBead tubes then bathed the swabs in the bead solution for 10 min before vortexing for 1 min, following [1]. Then, we aseptically removed the swab and proceeded with the protocol instructions by adding solution C1 (sodium dodecyl sulfate, SDS), following [1]. After adding solution C1, we inverted the sample tubes to mix them. Next, we vortexed the samples at maximum speed for 10 min, followed by centrifugation at 10 000 × g for 30 s. We then transferred the supernatant to a clean 2 mL collection tube.

Next, we modified the manufacturer protocol by combining steps 7 and 10, skipping steps 8 and 9. Specifically, we added solutions C2 and C3 (proprietary mixtures that contain inhibitor removal reagents that precipitate non-DNA organic and inorganic materials out of solution) at the same time rather than separately. After adding both solutions C2 and C3 to the collection tubes, we vortexed the tubes briefly, incubated them at 4 °C for 5 min, then centrifuged at 10 000 × g for 3 min. Avoiding the pellet, we then transferred up to 750 µL of the supernatant to a clean 2 mL collection tube.

After shaking thoroughly to mix Solution C4 (a proprietary mixture that is a high concentration salt solution containing guanidine hydrochloride and 2-propanol that precipitates DNA), we added 1200 µL to the supernatant and vortexed for 5 sec. We loaded 675 µL of the solution onto an MB Spin Column, centrifuged at 10 000 × g for 1 min, and discarded the flow through. This step was repeated twice, so that the entire sample was processed in this way.

Then, we added 500 µL of Solution C5 (an ethanol-based wash), centrifuged at 10 000 × g for 30 sec, and discarded the flow through before centrifuging again at 10 000 × g for 1 min. Following this, we placed the MB Spin Column into a clean 2 mL collection tube in a heat block held at 60 °C. Then, we modified the protocol again by adding 60 µL of 1X TE + 0.1 M EDTA (instead of Solution C6, an EDTA-free sterile 10 mM Tris elution buffer) to the centre of the filter membrane, and incubated the samples at 60 °C for 5 min before centrifuging at 10 000 × g for 1 min. Finally, we discarded the MB Spin Column and stored the DNA at -20 °C pending PCR amplification.

All centrifugation steps were carried out at room temperature (20 – 22 °C).

Detailed Methods

Study subjects and sample collection

The data used in this study were part of a larger study (Grieves et al. In Revision) of adult song sparrows captured and sampled at three breeding sites in southern Ontario, Canada: a southwest site on Western University property in London (43.008°N, 81.291°W; hereafter southwest), a central site at the rare Charitable Research Reserve in Cambridge (43.383ºN, 80.357ºW; hereafter central) and a northeast site on land owned by the Queen’s University Biological Station near Newboro (44.633°N, 76.330°W; hereafter northeast).

We captured and sampled 153 song sparrows from these three sites in 2017. Most birds were captured during the early breeding season (southwest: 8 birds [7 males, 1 females] between 2 – 5 May; central: 55 birds [32 males, 23 females] between 3 April – 1 May; northeast: 56 birds [38 males, 18 females] between 8 April – 3 May), although 34 birds (23 males, 11 females) were captured and sampled at the southwest site after the breeding season, between 8 August – 1 September. Of these 153 birds, data from 31 birds were used in the present study: 19 from the southwest site (captured in August 2017) and 12 from the central site (captured in April 2017). We selected this subset of birds for the present study because we had preen gland swabs, preen oil samples, and, critically, MHC genetic data for all of these birds.

Due to overlapping timing of sampling at the three sites, each bird was handled and sampled (see below) by one of three different researchers, R1 – R3. R1 collected breeding and postbreeding samples from the southwest site and breeding samples from the central site; R2 collected breeding samples from the southwest site and breeding samples from the northeast site; and R3 collected breeding samples from the northeast site. Researcher identity was noted for each sample.

Swabbing procedure and storage

We dipped a sterile medical grade cotton swab into sterile molecular grade water then rubbed the swab firmly around the gland. Swabbing was done using a continuous motion three times in each of the following directions: clockwise, counterclockwise and up and down along the rostral/caudal axis. The swab was cut to fit inside a sterile 1.5 mL tube, kept on ice for 1 – 4 h in the field, then stored at -20 ºC pending analysis. Each bird was handled using a fresh pair of nitrile gloves to minimize contamination.

Bacterial DNA extraction and 16S amplification

Extractions were carried out in 14 batches, each consisting of 23 preen gland swabs plus one extraction from a fresh sterile swab that served as a negative control (swab-only negative controls). The 14 batches of samples included 38 additional preen gland swabs that were part of a separate study. To screen for DNA contamination in the ‘sterile’ water used in swabbing, we also carried out DNA extractions from four fresh swabs dipped in this water after the end of the field season (water negative controls). We also extracted DNA from 3 swabs collected from the palms of R1 – R3 for comparison with preen gland and control swabs. In total, we extracted bacterial DNA from 209 swabs (153 + 38 preen gland samples + 14 swab-only negative controls + 4 water negative controls + 3 palm swabs). To minimize potential batch effects, we assigned swabs from the three sampling sites haphazardly to batch, such that roughly equal numbers of samples were extracted from each site in each batch. After completing all extractions, we used a Qubit Fluorometer to measure the DNA concentration of 14 samples, haphazardly selecting one sample from each batch.

We amplified the V4 region of the bacterial 16S rRNA gene using the universal primers F518 [2] and R806 [3]. In addition to the priming sequences, each primer included an Illumina MiSeq adaptor sequence, four randomized nucleotides and a unique ‘barcode’ of eight nucleotides, to allow assignment of recovered sequences to individuals. We performed PCR in a total volume of 25 µL, including 10 µL of 5PRIME HotMasterMix (Quantabio), 0.2 µM of each primer and 2 µL of DNA template (mean concentration = 0.1 ng/µL, range = 0.01 – 0.12 ng/µL). The thermocycling profile consisted of 2 min at 94 °C; 35 cycles of 45 s at 94 °C, 60 s at 50 °C and 90 s at 72 °C; and a 10 min final extension at 72 °C.

We confirmed amplification by running samples on a 2% agarose gel. Seven of the 14 swab-only negative controls and all four of the water negative controls showed a band of the expected product size (approx. 300 bp) of comparable or weaker intensity to our samples, indicating some source of contamination. Thus, we sequenced these 11 contaminated controls along with the 3 palm swabs and 191 preen gland samples (N = 205 swabs) so we could subtract likely contaminant sequences from subsequent stages of processing (see below).

Sequencing and pipeline

We pooled PCR products into a library and sequenced with 250 nt paired-end reads on an Illumina MiSeq at the London Regional Genomics Centre. We used a pipeline [4] to collapse sequences into clusters of identical reads and assign sequences to individuals. We used a second pipeline [5] and the R package dada2 [6] to overlap reads, remove ambiguous reads and filter out chimeras as well as singleton sequences (i.e. those appearing only once in the dataset) and those rarer than 0.1% in any sample. This pipeline resulted in an initial dataset containing 5243 unique sequences (i.e. sequence variants, SVs) from across all 205 samples. We assigned each SV to bacterial taxon by clustering at ≥ 97% sequence identity (following [4]) using the naïve Bayesian Ribosomal Database Project (RDP) Classifier [6,7].

High throughput sequencing generates relative abundance data that are thus compositional [8–10]. Accordingly, we used a compositional data analysis approach [11] that examines the read ratios between sequences. Most datasets do not actually contain all possible components; often, small values, including values below the detection limit of an instrument such as the Illumina MiSeq, are rounded off to zero. In such cases, zero counts reflect sampling or equipment limitations [12]. Accordingly, following [5], we used Bayesian-multiplicative replacement to impute values for zero count sequences using the R package zCompositions [12]. We then applied a centred log-ratio (clr) transformation to the zero-replaced data set, which renders the use of Euclidean distances meaningful in subsequent analyses [5,9].

Next, because rare sequences that occur in only a few samples are generally uninformative, and because samples with very low read counts are more likely to represent undersampling, we filtered sequences by the minimum proportion, minimum occurrence, and minimum sample count of reads. Thus, sequences found in fewer than 0.5% of reads, sequences found in fewer than 10% of samples, and samples with fewer than 5000 reads were removed from the initial dataset. The filtered dataset contained 47 unique sequences from 199 samples (149 from study birds, 10 negative controls, 2 palm swabs, and 38 preen gland swabs collected as part of a separate study).

Post-pipeline quality control

For our initial exploration of the dataset, we conducted a principal component analysis (PCA) of the clr-transformed data using zero-centered rotated variables and the ‘prcomp’ function in base R (following [9,13]). This permitted us to visually assess and identify any SVs likely to reflect contamination. We plotted the first and second principal components and examined the resultant biplot for SVs associated specifically with the contaminated control samples or the researcher palm samples. Finding no such SVs, we next plotted all possible pairwise combinations of the contaminated controls to further search for SVs that were shared among controls and thus likely to reflect contaminants. Here, we considered SVs that fell on or near the 1:1 line of each biplot as likely contaminants. However, no SVs were consistently common to contaminated control samples, so this approach identified no candidates for removal.

Finally, we examined the taxonomic assignment of SVs from each of the contaminated control samples to identify any obvious contaminants, such as bacterial taxa commonly found on human skin. We found no SVs meeting this criterion. While we cannot be certain that no contaminant SVs remained in our final preen gland swab dataset, we think it likely that the aggressive filtering steps we took removed the contaminating SVs, as these would have been orders of magnitude less abundant than the bacteria recovered from the birds.

As noted above, multiple researchers collected swabs, with different researchers sampling at different study sites. Thus, variation in swabbing technique or sample contamination with researcher-specific microbiota could cause samples to cluster spuriously by site. To address this possibility, we examined the PCA plot to determine if samples collected by the same researcher at different sites clustered together. Samples R1 collected from the southwest site did not cluster with those R1 collected from the central site, nor did samples R2 collected from the southwest site cluster with those R2 collected from the northeast site. Thus, researcher identity does not appear to have been an issue in this dataset.

Finally, we removed the 38 samples that were sequenced as part of a separate study, along with the 2 palm swabs and 10 contaminated control samples we sequenced, for a final data set of 47 SVs from 149 individual song sparrows (southwest: 42 [12 females, 30 males]; central: 51 [22 females, 29 males], northeast: 56 [18 females, 38 males], mean ± SE retained reads per individual = 8540 ± 1552). Of these 149 song sparrows, we retained data from the 31 birds for which we also had MHC genetic data (19 from southwest [11 males, 8 females] and 12 from central [8 males, 4 females]), as described in the manuscript methods.

Funding

Natural Sciences and Engineering Research Council of Canada