Phylogenetically under‐dispersed gut microbiomes are not correlated with host genomic heterozygosity in a genetically diverse reptile community

Holmes, Iris A.1 ; Grundler, Michael C.1

Published Nov 08, 2023 on Dryad. https://doi.org/10.5061/dryad.f7m0cfxzb

Abstract

We are providing semi-processed datasets relevant to the paper "Phylogenetically under-dispersed gut microbiomes across a range of host genetic diversity in a reptile community point to structuring by conserved host genes." Specifically, we include VCF files of RADseq data from host individuals, which are processed versions of the raw reads available at NCBI's Short Read Archive under PRJA744273. These data were processed for heterozygosity calculation using an adapted of the pipeline presented in Singhal et al. 2017, "Genetic diversity is largely unpredictable but scales with museum occurrences in a species-rich clade of Australian lizards."

https://doi.org/10.5061/dryad.f7m0cfxzb

In addition, we include a database of 16S sequences from gut microbiome amplicon sequencing from the same host animals. The raw reads are available at NCBI's Short Read Archive under PRJNA746253. The sequences accessioned here are a curated, cleaned set of reference reads to which we realigned reads from each individual host. We include hd-fin.fasta (a fasta file with a representative sequence for each OTU) hd_otu.tre (a phylogenetic tree of the otus), hd-fin.phylip.tre (same tree in phylip format), hd-fin.opti_mcc.list, hd-fin.opt_mcc.0.02.cons.taxonomy, and hd-fin.opti_mcc.shared (mothur output files that include a taxonomic designation for each OTU and the occurrence of the OTUs in the hosts). Our post-processing working files are hd_otu.fa (fasta of OTUs that pass our initial screen), hd_out.txt (a host-by-OTU matrix), and hd_otu.tax (a list of the taxonomy assigned to each OTU). In addition, we include the raw data from our negative control sequencing, and two host phylogenetic trees - liz2.tre, using in plotting, and tonini.tre, used in analyses.

We provide the code and all necessary intermediate files to replicate the analyses presented in our paper, title above.

Description of the data and file structure

All intermediate data files necessary to run the code are provided. These include hd_otu.fa, hd_otu.tax, and hd_otu.tre, which are a fasta of one representative sequence per otu, the taxonomic identity of the otus, and a tree built from the otus. Hd_otu.txt is a host-by-OTU matrix with elements showing the number of reads per OTU per host. The hd-fin files are output files for the otu tree from qiime2. The host vfc files are used to derive host heterozygosity. The neg-control fastas include the raw data from the negative control for the 16S run. Tonini.tre and liz2.tre are files used for calculating and plotting host phylogenetic distance, respectively.

Sharing/Access information

Data was derived from the following sources: Both 16S rRNA microbiome data and host RADseq data have been made available on NCBI's Short Read Archive, as BioProject PRJNA746253 and PRJNA744273 respectively.

Code/Software

HD_file_prep.R: processes mothur outputs (hd-fin.fasta, hi-fin.opti_mcc.0.03.cons.taxonomy, hd-fin.opti_mcc.list, hd-fin.phylip.tre, and hd-fin.opti_mcc.shared) into formats more convenient for further analysis (hd_otu.fa, hd_otu.tax, hd_otu.tre, and hd_otu.txt). Also includes code for supplementary figure 1, which calls the vcf files for the RADseq data from each host, provided as a zip file.

HD_stats_code.R: code for statistical methods not included in the code a specific figure. Also includes code for supplementary figure 3.
All files called by the file prep, stats, and figure codes are included in this upload. We also include our negative control output fastq file for reference.

The repository also contains code for each figure and the associated data analysis. The files called for each figure are listed in the first few lines of the script.

Table S1 contains metadata on individuals. The columns are:

sample = code for individual sample
genus, species, family = the taxonomic designations of the host
date = date of sampling
time = time of sampling, 24 hour, CST
lat and long = WGS 84 coordinates of host capture location
elevation = in meters, from GPS unit
mass = mass of host in grams
length = length of host in mm
sex = sex of host, determined in field from external morphology
age class = adult or juvenile, determined in field
heterozygosity = host heterozygosity based on proportion of heterozygous sites in RAD fragments
no_frag = number of fragments in RAD dataset
lag = number of fragments until heterozygosity values stabilized near the final measure
microbiome richness = number of OTUs in microbiome
microbiome Faith's diversity = phylogenetic diversity of OTUs in microbiome
mean randomized Faith's diversity = mean phylogenetic diversity calculated from equal-richness samples drawn at random
Faith's diversity p-value = proportion of random samples that have a Faith's diversity value smaller than the real samples

Laboratory methods

We extracted total DNA from tissue samples and cloacal swabs using DNeasy Blood and Tissue spin column kits from Qiagen. We prepared the tissue DNA according to a RADseq protocol from (Peterson, Weber, Kay, Fisher, & Hoekstra, 2012), using the restriction enzymes EcoR1 and Msp1. DNA fragments up to 400 bp in length were sequenced on an Illumina HiSeq platform using version 4 reagents over two paired end runs at the University of Michigan Sequencing Core Facility. For bacteria metabarcoding, we used an Illumina MiSeq platform at the University of Michigan Microbiome Core Facility to barcode a 252 bp sequence from the V4 region of the 16S rRNA DNA (Kozich et al., 2013). The library was prepared by the core and sequenced in a single lane with a negative and positive control.

Bioinformatic pipelines

For our RADseq data, we used a modification of the pipeline presented in (Singhal et al., 2017). This pipeline was tested against pyRAD (Eaton, 2014) and determined to provide more reliable estimates of heterozygosity (Singhal et al., 2017, electronic supplementary material). We first removed low-quality sequences and adapter sequences using Trimmomatic v0.33 (Bolger, Lohse, & Usadel, 2014) and assembled the reads using Rainbow v2.0.4 (Chong, Ruan, & Wu, 2012). We then generated a separate pseudogenome for each host individual by clustering the assembled reads using vsearch v1.0.2 (Rognes, Flouri, Nichols, Quince, & Mahé, 2016). Finally, we mapped the raw reads back to the pseudogenome using the GATK function haplotypeCaller (McKenna et al., 2010; Poplin et al., 2017). Assembling per-individual rather than per-species pseudogenomes allowed us to avoid biases introduced by variation in the number of fragments per individual and differences in sample size between species (Cariou et al., 2016). Using the GATK haplotypeCaller function allowed us to specify the ploidy of the hosts as an input variable for the program. Using published values, we specified that our unisexual species were triploid (Reeder et al., 2002), while the other hosts were diploid. We retained all sequenced DNA fragments with a read depth greater than 20 and called heterozygote bases when the rarer of the two alleles was represented in more than 40% of the reads for diploids or when the rarest of the three alleles was represented in more than 20% of the reads for triploids. We retained host individuals that had more than 100 fragments of a length greater than 200 bp that passed our filters (Table S1).

For the bacterial 16S rRNA metabarcode data from cloacal swabs, we used the program mothur v.1.48 (Schloss et al., 2009). For our 94 microbiome samples and a negative sequencing control, we made contigs from our paired-end sequences, selected fragments within 8 bp of our target 252 bp length, and removed reads with homopolymers over eight base-pairs long or with ambiguous base calls. We then aligned the sequences against the reference bacteria in the SILVA 16S rRNA database release 138 (Quast et al., 2012; Yilmaz et al., 2014) and removed the sequences that did not overlap the target alignment region on the reference dataset or aligned with less that 80% similarity. We removed chimeric sequences, then obtained the taxonomic classification of our remaining sequences by aligning them against the June 2020 RDP data release (Cole et al., 2014; Wang, Garrity, Tiedje, & Cole, 2007). We retained only those sequences that aligned most closely to bacteria references. We clustered sequences at a 97% similarity threshold to obtain OTUs using the OptiClust algorithm (Westcott & Schloss, 2017), then obtained a final taxonomic classification for our OTUs using the RDP dataset. Finally, we generated a phylogenetic tree using the ‘clearcut’ command (Sheneman, Evans, & Foster, 2006).

Using a custom script in R, we removed the 12 OTUs that occurred in the negative control from our dataset. Following visual inspection of the distribution of sample sizes, we retained the data from all hosts with more than 2,000 total sequences. We chose our threshold to optimize between sample retention and sequencing depth. The threshold of 2,000 reads occurred at a natural breakpoint in our histogram of sequences recovered per host (Figure S1 A) and came at a point at which the slopes of the rarefaction curves for most microbiomes were beginning to flatten (Figure S1 B). Because some of our downstream analyses are vulnerable to unbalanced sample sizes (N. J. Gotelli & Colwell, 2001), we rarefied our host-by-OTU matrix to 2,000 sequences per host using the ‘rrarefy’ command from the R package ‘vegan’ (Oksanen et al., 2018).

The largest sequencing depth for any OTU that occurred in the negative control was two. We therefore set all OTU calls in our dataset with sequencing depths less than four to zero. This procedure conservatively accounted for potential rates of index hopping that could have caused the OTUs to appear in our negative control. Finally, we removed any individuals from the dataset that did not meet our host heterozygosity threshold. We also modified mothur output files to generate a taxonomy file for the OTUs, a fasta file with one representative sequence per OTU taken from the first listed sequence name in the OptiClust output, and a phylogenetic tree of the OTUs.

Phylogenetically under‐dispersed gut microbiomes are not correlated with host genomic heterozygosity in a genetically diverse reptile community

Data files

Abstract

README: Phylogenetically under‐dispersed gut microbiomes are not correlated with host genomic heterozygosity in a genetically diverse reptile community

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Usage notes

Works referencing this dataset