Skip to main content
Dryad

Data and code from: Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data

Data files

Jun 10, 2024 version files 2.26 GB
Sep 19, 2024 version files 439.45 MB

Abstract

We use open source human gut microbiome data to learn a microbial “language” model by adapting techniques from Natural Language Processing (NLP). Our microbial “language” model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial species and the common compositional patterns in microbial communities. The learned model produces contextualized taxa representations that allow a single bacteria species to be represented differently according to the specific microbial environment it appears in. The model further provides a sample representation by collectively interpreting different bacteria species in the sample and their interactions as a whole. We show that, compared to baseline representations, our sample representation consistently leads to improved performance for multiple prediction tasks including predicting Irritable Bowel Disease (IBD) and diet patterns. Coupled with a simple ensemble strategy, it produces a highly robust IBD prediction model that generalizes well to microbiome data independently collected from different populations with substantial distribution shift.

We visualize the contextualized taxa representations and find that they exhibit meaningful phylum-level structure, despite never exposing the model to such a signal. Finally, we apply an interpretation method to highlight bacterial species that are particularly influential in driving our model’s predictions for IBD.