Metagenomics tools employed in microbiome research, 2018-2023, per peer-reviewed publications

Szymanski, Erika 1 ; Kosmiski, Kalyn1; Myers, Katelyn2

Published Oct 07, 2024 on Dryad. https://doi.org/10.5061/dryad.31zcrjdw8

Data files

Oct 07, 2024 version files 380.24 KB

BlackBoxToolCategorizationSpreadsheet_092624.xlsx

376.03 KB
README.md

4.21 KB

Abstract

Bioinformatics tools for processing metagenomic data embed choices about how to correlate DNA sequences with the presence of microbial taxa. Because no single correct way to make these choices has been or can currently be established, tools may embed different choices, and thus different assumptions about what constitutes valid evidence of a microorganism. We set out to document how those assumptions varied across the range of microbiome bioinformatics tools in current use. However, we were unable to do so because bioinformatics methods are inconsistently and incompletely documented in the peer-reviewed literature. Those omissions are important to how methodological choices can be accounted for in in interpreting results, and to the capacity for microbiome research to expand upon current understandings of how microorganisms exist. We advocate for more complete and transparent communication of bioinformatics choices in the published microbiome literature, for reasons concerning accessibility, education, data reusability, and standardization.

https://doi.org/10.5061/dryad.31zcrjdw8

Description of the data and file structure

Bioinformatics tools for processing metagenomic data embed choices about how to correlate DNA sequences with the presence of microbial taxa. Because no single correct way to make these choices has been or can currently be established, tools may embed different choices, and thus different assumptions about what constitutes valid evidence of a microorganism. We set out to document how those assumptions varied across the range of microbiome bioinformatics tools in current use. We began this social scientific investigation by asking: what assumptions about microbial identity are embedded in the computational tools employed to process microbiome sequencing data? What do those tools—and therefore researchers who work with microbiome sequencing data—(explicitly or implicitly) assume a microbe is, and how can they tell when they’ve detected one? What is thrown out as junk rather than retained as a coherent sign of life? We found that these questions are not easily answered, at least not via peer-reviewed publications, because publications are not transparent about bioinformatics decisions.

Files and variables

File: BlackBoxToolCategorizationSpreadsheet070623.xlsx

Description: List and characterization of each metagenomics tool investigated in this study

Variables

Tool: Any software program employed in microbiome research for processing and interpreting sequencing data to identify microbes in a sample.
Reference 1: Example article in which the tool appears that we assessed
System 1: Study system (model, application, problem, case, etc.) in Reference 1
Research Q 1: Paraphrase or quote of question addressed in Reference 1
Discipline 1: General disciplinary umbrella for Reference 1
Used with other tools? 1: List of other tools employed with target tool in Reference 1
Date 1: Initial publication date of Reference 1
Reference 2: Example article in which the tool appears that we assessed
System 2: Study system (model, application, problem, case, etc.) in Reference 2
Research Q 2: Paraphrase or quote of question addressed in Reference 2
Discipline 2: General disciplinary umbrella for Reference 2
Used with other tools? 2: List of other tools employed with target tool in Reference 2
Date 2: Initial publication date of Reference 2
Reference 3: Example article in which the tool appears that we assessed
System 3: Study system (model, application, problem, case, etc.) in Reference 3
Research Q 3: Paraphrase or quote of question addressed in Reference 3
Discipline 3: General disciplinary umbrella for Reference 3
Used with other tools? 3: List of other tools employed with target tool in Reference 3
Date 3: Initial publication date of Reference 3
Reference 4: Example article in which the tool appears that we assessed (4th reference assessed only when warranted by the breadth of systems, RQs, and/or disciplines across which a given tool appeared to be used)
System 4: Study system (model, application, problem, case, etc.) in Reference 4
Research Q 4: Paraphrase or quote of question addressed in Reference 4
Discipline 4: General disciplinary umbrella for Reference 4
Used with other tools? 4: List of other tools employed with target tool in Reference 4
Date 4: Initial publication date of Reference 4
Reference 5: Example article in which the tool appears that we assessed (5th reference assessed only when warranted by the breadth of systems, RQs, and/or disciplines across which a given tool appeared to be used)
System 5: Study system (model, application, problem, case, etc.) in Reference 5
Research Q 5: Paraphrase or quote of question addressed in Reference 5
Discipline 5: General disciplinary umbrella for Reference 5
Used with other tools? 5: List of other tools employed with target tool in Reference 5
Date 5: Initial publication date of Reference 4

Code/software

Excel or any other spreadsheet-viewing application

Data was derived from the following sources:

PubMed, Google Scholar

This dataset is a catalog of common tools in current use as of 2023. We searched PubMed and Google Scholar for “microbiome” + “tool name,” deriving an initial list of tool names from recent review articles and then adding tools mentioned in association with other tools until we reached saturation. We excluded tools that were not cited after 2020 (not in current use), that were not cited more than three times or by more than one research group (not common), and statistical tools that were not specific to microbiome sequencing analysis.

We then characterized each tool in our catalog, organizing them into groups with similar functions. For each tool, we searched for “microbiome + [tool name]” as keywords in PubMed and Google Scholar, choosing 3-5 articles published between 2012-2022 that encompassed the diversity of topics or subdisciplines represented in the search results. We searched these publications for model system or application area; disciplinary orientation (generalizing from the publication, author affiliations, and system), research question, timeframe, and other tools used together with the tool that we were characterizing. We then sorted tools into categories on the basis of their data processing role. Because we observed category names being used inconsistently in the literature, we constructed and named categories that could be defined and distinguished on the basis of our observations (grounded coding, a common qualitative social science method). When categorizing a tool on the basis of our initially selected publications was difficult, we returned to our search results for additional detail, and we consulted any openly available documentation provided by the tool developer(s).

After the initial data collection step, in which tools and their uses in specific research articles were documented, we began to classify tools into the categories listed in Table 1. To understand the function of a tool, we searched for the tool within each article. We recorded the input and output of a tool. If the data transformation performed by the tool was clear, we categorized the tool based on definitions included in Table 1, matching the tool to the category it most closely aligned with. For example, metagenomic sequencing data is inputted into MetaGeneMark, and predicted genes are outputted, so MetaGeneMark was classified as a protein prediction tool. While some tools shared a function across all documented research papers, others had more inconsistent uses. In the case that the classification of a tool was not clear based on its uses in the documented research papers, we consulted documentation provided by the original tool developers (often tool announcements) to determine the intended use of the tool.

Many tools that appeared to fulfill the same purpose were described in different terms in different papers. For example, “binning” is used to describe both supervised and unsupervised taxonomy algorithms, encompassing steps that might be characterized more specifically as “taxonomic assignment/classification” or “clustering,” respectively. Common biology terms such as “gene,” “genome,” and “microbiome” itself routinely and unproblematically have multiple meanings as established knowledge shifts over time. However, in this case, context does not always clear up terminological ambiguity, and that ambiguity matters to authors’ abilities to readily recognize how inferences are being made about microbial identity in any given study. Consequently, we constructed and named categories that could be defined and distinguished on the basis of our observations (grounded coding, a common qualitative social science method). When categorizing a tool on the basis of our initially selected publications was difficult, we returned to our search results for additional detail, and we consulted any openly available documentation provided by the tool developer(s).