Data from: Early life microbial succession in the gut follows common patterns in humans across the globe
Data files
Jul 08, 2025 version files 7.85 GB
-
AgeModel_FullCV_Results.jld2
4.61 GB
-
ecs_metadata.csv
16.11 KB
-
ecs.arrow
815.66 MB
-
filtered_taxonomic_inputs.csv
2.83 MB
-
full_taxonomic_inputs.csv
15.41 MB
-
inputs_with_testdata.csv
2.88 MB
-
README.md
7.44 KB
-
regression_Age_LeaveCMDOut.jld
276.23 MB
-
regression_Age_LeaveCombineOut.jld
353.40 MB
-
regression_Age_LeaveEchoOut.jld
366.50 MB
-
regression_Age_LeaveEnnisOut.jld
390.57 MB
-
regression_Age_LeaveGerminaOut.jld
291.45 MB
-
regression_Age_LeaveKhulaOut.jld
345.98 MB
-
regression_Age_LeaveM4EFADOut.jld
381.74 MB
-
visit_annotation_glossary.xlsx
9.60 KB
Abstract
Characterizing the dynamics of microbial community succession in the infant gut microbiome is crucial for understanding child health and development, but no normative model currently exists. Here, we estimate child age using gut microbial taxonomic relative abundances from metagenomes, with high temporal resolution (±3 months) for the first 1.5 years of life. Using 3,154 samples from 1,827 infants across 12 countries, we trained a random forest model, achieving a root mean square error of 2.61 months. We identified key taxonomic predictors of age, including declines in Bifidobacterium spp. and increases in Faecalibacterium prausnitzii and Lachnospiraceae. Microbial succession patterns are conserved across infants from diverse human populations, suggesting universal developmental trajectories. Functional analysis confirmed trends in key microbial genes involved in feeding transitions and dietary exposures. This model provides a normative benchmark of “microbiome age” for assessing early gut maturation that can be used alongside other measures of child development.
https://doi.org/10.5061/dryad.dbrv15f9z
This dataset contains:
- Pre-formatted selected Taxonomic profiles from stool metagenomic sequencing obtaine from a combination of:
- 8 previously-published and publicly-available studies: Asnicar, F. et al. (2017) [1], Backhed, F. et al. (2015) [2], Kostic, A. D. et al. (2015) [3], Pehrsson, E. et al. (2016) [4], Shao, Y. et. al (2019) [5], Vatanen, T. et al. (2016) [6], Yassour, M. et al. (2018) [7], Bonham, K. et al. (2023) [8]; and
- 4 studies from the Wellcome Leap 1kD program: Fatori, D. et al. (2024) [9], Hemmingway, A. et al. (2020) [10], O'Sullivan, J. et al. (2024) [11] and the publication accompanying this dataset, Bottino, G. et al. (2024), which introduces the samples from the Khula study, Zieff, M. et al. (2024) [12]
- Pre-formatted selected functional EC profiles for the samples from the Khula study
Description of the data and file structure
This dataset contains 4 files:
full_taxonomic_inputs.csv
is the file containing all taxonomic profiles prior to prevalence filtering and removal of samples with no reads assigned to prevalence-filtered taxa.filtered_taxonomic_inputs.csv
is the post-filter version of the same file, which contains only the samples and features utilized to train the Machine Learning models and to perform downstream analysis. These files have the following structure:study_name
: String, identified by publication , i.e.BonhamK_2023
.datagroup
: String, meta-detagroup of samples. Options areECHO
,LEAP
andCMD
.site
: String, ISO 3166-1 alpha-3 three-letter code for country of origin of samples.datacolor
: String, manuscript-wide color adopter for data source.datasource
: String, major data source for the sample. Options areECHO-RESONANCE
,CMD-DIABIMMUNE
,CMD-OTHER
,1kDLEAP-KHULA
,1kDLEAP-GERMINA
,1kDLEAP-COMBINE
,1kdLEAP-M4EFAD
.CMD-DIABIMMUNE
andCMD-OTHER
are combined into the singleCMD
meta category on most downstream analyses.
visit
: String, nominal timepoint of stool sample collection, when available, on the original annotation format of each study. For a better understanding of acronyms in this column, please refer to thevisit_annotation_glossary.xlsx
file.westernized_cat
: String, categorization ofsite
as a westernized country.subject_id
: String, unique study participant identifierageMonths
: Numeric, age at stool sample collection, in months.sample
: String, unique sample identifier- Multiple columns of taxa abundances encoded with species name in the form
Genus_species
, i.e.Escherichia_coli
orKlebsiella_pneumoniae
. Values in relative abundances on the scale0.0:100.0
Shannon_index
: alpha-diversity of the sample measured as the Shannon index
ecs.arrow
is the file containing all species-stratified functional EC profiles for the 426 prevalence-filtered Khula samples from the prefiousfiltered_inputs.csv
.- Columns are samples and rows are ECs with the format
X.X.X.X: Function|Genus_species
, i.e.1.1.1.88: Hydroxymethylglutaryl-CoA reductase|s__Staphylococcus_aureus
- Metadata can be obtained by joining on
ecs_metadata.csv
with column names matching thesample
column
- Columns are samples and rows are ECs with the format
AgeModel_FullCV_Results.jld2
are the results of the model training experiments with cross-validation, for full cross-cohort CV with hyperparameter grid search. Additional Leave-One-Source-Out crossvalidation results are also given on.jld
format, for each of the data sources mentioned in the main text:regression_Age_LeaveCMDOut.jld
,regression_Age_LeaveCombineOut.jld
,regression_Age_LeaveEchoOut.jld
,regression_Age_LeaveGerminaOut.jld
,regression_Age_LeaveKhulaOut.jld
,regression_Age_LeaveM4EFADOut.jld
; to be read with theJLD2.jl
library and the classes from theLeap.jl
julia package available on Zenodo (opens in new window)- We have also provided
inputs_with_testdata.csv
, which is a copy offull_taxonomic_inputs.csv
with the held-out independent test samples (Ennis, 2024) [13] appended at the end. We similarly made availableregression_Age_LeaveEnnisOut.jld
as the test data results.
Sharing/Access information
Code for loading the data and performing analyses is available on Zenodo (opens in new window)
Code/Software
See the Readme and setup instructions for using code available on Zenodo (opens in new window)
- Asnicar, F. et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems 2, (2017).
- Bäckhed, F. et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe 17, 690–703 (2015).
- Kostic, A. D. et al. The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe 17, 260–273 (2015).
- Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature 533, 212–216 (2016).
- Shao, Y. et al. Stunted microbiota and opportunistic pathogen colonization in caesarean-section birth. Nature 574, 117–121 (2019).
- Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).
- Yassour, M. et al. Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability. Sci. Transl. Med. 8, 343ra81 (2016).
- Bonham, K. S. et al. Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children. Sci Adv 9, eadi0497 (2023).
- Fatori, D. et al. Identifying biomarkers and trajectories of executive functions and language development in the first 3 years of life: design, methods and initial findings of the Germina cohort study. Preprint at https://doi.org/10.31219/osf.io/ed4fb (2024).
- Hemmingway, A. et al. A detailed exploration of early infant milk feeding in a prospective birth cohort study in Ireland: combination feeding of breast milk and infant formula and early breast-feeding cessation. Br. J. Nutr. 124, 440–449 (2020).
- O’Sullivan, J. et al. Alterations in gut microbiota composition, plasma lipids, and brain activity, suggest inter-connected pathways influencing malnutrition-associated cognitive and neurodevelopmental changes. Preprint at https://doi.org/10.21203/rs.3.rs-4115616/v1 (2024).
- Zieff, M. R. et al. Characterizing developing executive functions in the first 1000 days in South Africa and Malawi: The Khula Study. Wellcome Open Research 9, (2024).
- Ennis, D., Shmorak, S., Jantscher-Krenn, E. & Yassour, M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat. Commun. 15, 894 (2024).
Raw metagenomic sequence reads were processed using tools from the bioBakery suite, following already-established protocols [1]. Initially, KneadData v0.10.0 was employed with default settings to trim low-quality reads and eliminate human sequences, using the hg37 reference database. Subsequently, MetaPhlAn v3.1.0, utilizing the mpa_v31_CHOCOPhlAn_201901 database, was applied with default parameters to map microbial marker genes and generate taxonomic profiles. The taxonomic profiles, along with the same reads obtained in the initial step, were then processed with HUMAnN v3.7 to produce stratified functional profiles.