Data from: Early life microbial succession in the gut follows common patterns in humans across the globe
Data files
Jul 08, 2025 version files 7.85 GB
-
AgeModel_FullCV_Results.jld2
4.61 GB
-
ecs_metadata.csv
16.11 KB
-
ecs.arrow
815.66 MB
-
filtered_taxonomic_inputs.csv
2.83 MB
-
full_taxonomic_inputs.csv
15.41 MB
-
inputs_with_testdata.csv
2.88 MB
-
README.md
7.44 KB
-
regression_Age_LeaveCMDOut.jld
276.23 MB
-
regression_Age_LeaveCombineOut.jld
353.40 MB
-
regression_Age_LeaveEchoOut.jld
366.50 MB
-
regression_Age_LeaveEnnisOut.jld
390.57 MB
-
regression_Age_LeaveGerminaOut.jld
291.45 MB
-
regression_Age_LeaveKhulaOut.jld
345.98 MB
-
regression_Age_LeaveM4EFADOut.jld
381.74 MB
-
visit_annotation_glossary.xlsx
9.60 KB
Abstract
Characterizing the dynamics of microbial community succession in the infant gut microbiome is crucial for understanding child health and development, but no normative model currently exists. Here, we estimate child age using gut microbial taxonomic relative abundances from metagenomes, with high temporal resolution (±3 months) for the first 1.5 years of life. Using 3,154 samples from 1,827 infants across 12 countries, we trained a random forest model, achieving a root mean square error of 2.61 months. We identified key taxonomic predictors of age, including declines in Bifidobacterium spp. and increases in Faecalibacterium prausnitzii and Lachnospiraceae. Microbial succession patterns are conserved across infants from diverse human populations, suggesting universal developmental trajectories. Functional analysis confirmed trends in key microbial genes involved in feeding transitions and dietary exposures. This model provides a normative benchmark of “microbiome age” for assessing early gut maturation that can be used alongside other measures of child development.
https://doi.org/10.5061/dryad.dbrv15f9z
This dataset contains:
- Pre-formatted selected Taxonomic profiles from stool metagenomic sequencing obtaine from a combination of:
- 8 previously-published and publicly-available studies: Asnicar, F. et al. (2017) [1], Backhed, F. et al. (2015) [2], Kostic, A. D. et al. (2015) [3], Pehrsson, E. et al. (2016) [4], Shao, Y. et. al (2019) [5], Vatanen, T. et al. (2016) [6], Yassour, M. et al. (2018) [7], Bonham, K. et al. (2023) [8]; and
- 4 studies from the Wellcome Leap 1kD program: Fatori, D. et al. (2024) [9], Hemmingway, A. et al. (2020) [10], O'Sullivan, J. et al. (2024) [11] and the publication accompanying this dataset, Bottino, G. et al. (2024), which introduces the samples from the Khula study, Zieff, M. et al. (2024) [12]
- Pre-formatted selected functional EC profiles for the samples from the Khula study
Description of the data and file structure
This dataset contains 4 files:
full_taxonomic_inputs.csvis the file containing all taxonomic profiles prior to prevalence filtering and removal of samples with no reads assigned to prevalence-filtered taxa.filtered_taxonomic_inputs.csvis the post-filter version of the same file, which contains only the samples and features utilized to train the Machine Learning models and to perform downstream analysis. These files have the following structure:study_name: String, identified by publication , i.e.BonhamK_2023.datagroup: String, meta-detagroup of samples. Options areECHO,LEAPandCMD.site: String, ISO 3166-1 alpha-3 three-letter code for country of origin of samples.datacolor: String, manuscript-wide color adopter for data source.datasource: String, major data source for the sample. Options areECHO-RESONANCE,CMD-DIABIMMUNE,CMD-OTHER,1kDLEAP-KHULA,1kDLEAP-GERMINA,1kDLEAP-COMBINE,1kdLEAP-M4EFAD.CMD-DIABIMMUNEandCMD-OTHERare combined into the singleCMDmeta category on most downstream analyses.
visit: String, nominal timepoint of stool sample collection, when available, on the original annotation format of each study. For a better understanding of acronyms in this column, please refer to thevisit_annotation_glossary.xlsxfile.westernized_cat: String, categorization ofsiteas a westernized country.subject_id: String, unique study participant identifierageMonths: Numeric, age at stool sample collection, in months.sample: String, unique sample identifier- Multiple columns of taxa abundances encoded with species name in the form
Genus_species, i.e.Escherichia_coliorKlebsiella_pneumoniae. Values in relative abundances on the scale0.0:100.0 Shannon_index: alpha-diversity of the sample measured as the Shannon index
ecs.arrowis the file containing all species-stratified functional EC profiles for the 426 prevalence-filtered Khula samples from the prefiousfiltered_inputs.csv.- Columns are samples and rows are ECs with the format
X.X.X.X: Function|Genus_species, i.e.1.1.1.88: Hydroxymethylglutaryl-CoA reductase|s__Staphylococcus_aureus - Metadata can be obtained by joining on
ecs_metadata.csvwith column names matching thesamplecolumn
- Columns are samples and rows are ECs with the format
AgeModel_FullCV_Results.jld2are the results of the model training experiments with cross-validation, for full cross-cohort CV with hyperparameter grid search. Additional Leave-One-Source-Out crossvalidation results are also given on.jldformat, for each of the data sources mentioned in the main text:regression_Age_LeaveCMDOut.jld,regression_Age_LeaveCombineOut.jld,regression_Age_LeaveEchoOut.jld,regression_Age_LeaveGerminaOut.jld,regression_Age_LeaveKhulaOut.jld,regression_Age_LeaveM4EFADOut.jld; to be read with theJLD2.jllibrary and the classes from theLeap.jljulia package available on Zenodo (opens in new window)- We have also provided
inputs_with_testdata.csv, which is a copy offull_taxonomic_inputs.csvwith the held-out independent test samples (Ennis, 2024) [13] appended at the end. We similarly made availableregression_Age_LeaveEnnisOut.jldas the test data results.
Sharing/Access information
Code for loading the data and performing analyses is available on Zenodo (opens in new window)
Code/Software
See the Readme and setup instructions for using code available on Zenodo (opens in new window)
- Asnicar, F. et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems 2, (2017).
- Bäckhed, F. et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe 17, 690–703 (2015).
- Kostic, A. D. et al. The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe 17, 260–273 (2015).
- Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature 533, 212–216 (2016).
- Shao, Y. et al. Stunted microbiota and opportunistic pathogen colonization in caesarean-section birth. Nature 574, 117–121 (2019).
- Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).
- Yassour, M. et al. Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability. Sci. Transl. Med. 8, 343ra81 (2016).
- Bonham, K. S. et al. Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children. Sci Adv 9, eadi0497 (2023).
- Fatori, D. et al. Identifying biomarkers and trajectories of executive functions and language development in the first 3 years of life: design, methods and initial findings of the Germina cohort study. Preprint at https://doi.org/10.31219/osf.io/ed4fb (2024).
- Hemmingway, A. et al. A detailed exploration of early infant milk feeding in a prospective birth cohort study in Ireland: combination feeding of breast milk and infant formula and early breast-feeding cessation. Br. J. Nutr. 124, 440–449 (2020).
- O’Sullivan, J. et al. Alterations in gut microbiota composition, plasma lipids, and brain activity, suggest inter-connected pathways influencing malnutrition-associated cognitive and neurodevelopmental changes. Preprint at https://doi.org/10.21203/rs.3.rs-4115616/v1 (2024).
- Zieff, M. R. et al. Characterizing developing executive functions in the first 1000 days in South Africa and Malawi: The Khula Study. Wellcome Open Research 9, (2024).
- Ennis, D., Shmorak, S., Jantscher-Krenn, E. & Yassour, M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat. Commun. 15, 894 (2024).
Raw metagenomic sequence reads were processed using tools from the bioBakery suite, following already-established protocols [1]. Initially, KneadData v0.10.0 was employed with default settings to trim low-quality reads and eliminate human sequences, using the hg37 reference database. Subsequently, MetaPhlAn v3.1.0, utilizing the mpa_v31_CHOCOPhlAn_201901 database, was applied with default parameters to map microbial marker genes and generate taxonomic profiles. The taxonomic profiles, along with the same reads obtained in the initial step, were then processed with HUMAnN v3.7 to produce stratified functional profiles.
