Skip to main content

Data from: Pitfalls and pointers: an accessible guide to marker gene amplicon sequencing in ecological applications

Cite this dataset

Porath-Krause, Anita et al. (2021). Data from: Pitfalls and pointers: an accessible guide to marker gene amplicon sequencing in ecological applications [Dataset]. Dryad.


Next Generation Sequencing (NGS) is a powerful tool that has been rapidly adopted by many ecologists studying microbial communities. Despite the exciting demonstration of NGS technology as a tool for ecological research, cryptic pitfalls inherent to its use can obscure correct interpretation of NGS data. Here, we provide an accessible overview of a NGS process that uses marker gene amplicon sequences (MGAS) that will allow scientists, particularly community ecologists, to make appropriate methodological choices and understand limits on inference about community composition and diversity that can be drawn from MGAS data.

We describe the MGAS pipeline, focusing specifically on cryptic sources of variation that have received less emphasis in the ecological literature, but which may substantially impact inference about microbial community diversity and composition. By simulating communities from published microbiome data, we demonstrate how these sources of variation can generate inaccurate or misleading patterns.

We specifically highlight sample dilution without researcher awareness and lane-to-lane variability, two cryptic sources of variation arising during the MGAS pipeline. These sources of variation affect estimates of species presence and relative abundance, particularly for species with moderate to low abundances. Each of these sources of bias can lead to errors in the estimation of both absolute and relative abundance within, and turnover among, microbial communities.

Awareness and understanding of what happens and, specifically, why it happens during MGAS generation is key to generating a strong data set and building a robust community matrix. Requesting sample dilution information from the sequencing center, including technical replicates across sequencing lanes, and understanding how sampling intensity and community taxa distribution patterns shape the measurement of community richness, evenness, and diversity are critical for drawing correct ecological inferences using MGAS data.


Comparisons within and across communities commonly involve measures of diversity.  With the awareness that diluting samples or placing samples on different lanes can exclude rare taxa and misrepresent species abundance and presence, building and working with the community matrix generated by MGAS data raises downstream concerns.  Here we demonstrate how the community matrix can affect measures of diversity depending on the types of communities (skew of their rank-abundance distributions) and sampling effort (i.e., depth of sequencing).  Using published MGAS data, we evaluate the bias arising from sampling communities with different evenness (i.e., skew in rank-abundance distribution) at different intensities. Utilizing ITS1 sequence data generated with Illumina MiSeq that represent foliar fungal endophyte community samples from about 120 prairie grass plants across 4 sites in the north central United States (Seabloom et al. 2019), we fit rank-abundance distributions to the OTU's using the vegan package (Oksanen et al., 2008) in R version 3.5.2 (R Core Team, 2013).  We then simulated communities with a skew to match the empirical data (Fig. 3, center column), doubling the skew parameter (left column) or halving the skew (right column) to generate three versions of skew that are centered around empirical data.  These examples (Fig. 3) represent typical communities ranging from high evenness, such as dispersed plant litter (Albright & Martiny, 2018), to very low evenness (highly skewed), such as the gut microbiome of human infants (Pannaraj et al., 2017).  All simulated communities contained 2,753 species, which was identified as the number of species in the empirical data.

We simulated communities with multinomial distributions of 2,753 categories (representing the number of unique species from the empirical data), with probabilities of each category drawn from the three rank abundance distributions.  For each skew scenario, we simulated communities of 1 million, 13 million, or 26 million OTUs, to represent a range of microbial community sizes identified across natural systems and in mock communities (from aphids (Jousselin et al., 2016), plants (Seabloom et al., 2019), and mock communities (Bokulich et al., 2013), respectively).  Then we sub-sampled each simulated ‘parent’ community by randomly drawing 100, 200, 500, 1000, 2500, 5000, 7500, or 10000 individuals (representing potential sample dilution and the depth of sequencing).  For both the simulated parent communities (horizontal dashed lines in Fig. 3) and the sub-sampled communities (points with SE), we calculated richness, abundance-rarefied richness (rarified to 1,000 individuals)  and common measures of alpha diversity including Shannon’s diversity index (Shannon, 1948) and inverse Simpson’s diversity (Simpson, 1949), also known as the Effective Number of Species based on the Probability of Interspecific Encounter (ENSPIE) (Chase & Knight, 2013).  We also calculated the compositional similarity between the sample and the parent (full) community using abundance-based Bray-Curtis (Bray & Curtis, 1957) and incidence-based Jaccard (Jaccard, 1901) similarities.  The simulations were replicated 100x, 20x, and 10x for the communities of 1 million, 13 million, and 26 million individuals, respectively.  


National Science Foundation, Award: DEB1241895