Skip to main content

Data from: Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children

Cite this dataset

Bonham, Kevin; Klepac-Ceraj, Vanja; Fahur Bottino, Guilherme (2024). Data from: Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children [Dataset]. Dryad.


The gastrointestinal tract, its resident microorganisms, and the central nervous system are connected by biochemical signaling, also known as the ”microbiome-gut-brain-axis.” Both the human brain and the gut microbiome have critical developmental windows in the first years of life, raising the possibility that their development is co-occurring and likely co-dependent. Emerging evidence implicates gut microorganisms and microbiota composition in cognitive outcomes and neurodevelopmental disorders (e.g., autism and anxiety), but the influence of gut microbial metabolism on typical neurodevelopment has not been explored in detail. We investigated the relationship of the microbiome with the neuroanatomy and cognitive function of 381 healthy children, demonstrating that differences in gut microbial taxa and gene functions are associated with overall cognitive function and with differences in the size of multiple brain regions. Using a combination of multivariate linear and machine learning (ML) models, we showed that many species, including Alistipes obesi and Blautia wexlerae, were associated with higher cognitive function, while some species such as Ruminococcus gnavus were more commonly found in children with low cognitive scores after controlling for sociodemographic factors. Microbial genes for enzymes involved in the metabolism of neuroactive compounds, particularly short-chain fatty acids such as acetate and propionate, were also associated with cognitive function. In addition, ML models were able to use microbial taxa to predict the volume of brain regions, and many taxa that were identified as important in predicting cognitive function also dominated the feature importance metric for individual brain regions, and for specific subscales of cognitive function. For example, B. wexlerae was the most important species in models predicting the size of the parahippocampal region in both the left and right hemispheres and was among the top predictors of gross motor and expressive language performance. Several species from the phylum Bacteroidetes, including GABA-producing Bacteroides ovatus, were important for predicting the size of the left accumbens area, but not the right. These findings provide potential biomarkers of neurocognition and brain development and may lead to the future development of targets for early detection and early intervention.

README: Data associated with "Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children"

Taxonomic and functional profiles from stool metagenomic sequencing from RESONANCE, an accelerated longitudinal cohort of child cognitive development

Description of the data and file structure

These data tables are used as inputs for the analysis in "Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children", published in Science Advances.

Files are comma-separated value (.csv) or Apache Arrow (.arrow) formatted. Julia code for loading and analyzing these data can be found in the accompanying Github repository (archived on Zenodo).

Microbial data

 All *.arrow files have the following headers:

  • sample (String): Unique identifier for a particular stool sample. Can be joined to subject/timepoint metadata in complete_filtered_dataset.csv using the later's omni column (see below).
  • sidx (Int): sample (column) index used for populating sparse matrix representation
  • feature (String): microbial feature (eg taxon or UniRef90 ID)
  • fidx (Int): feature (row) index used for populating sparse matrix representation
  • value (Float): relative abundance value
  • taxa.arrow - Taxonomic profiles generated by MetaPhlAn, features are taxa. Each taxon level from Kingdom to species, and the abundances of each taxon level for each sample should sum to 100.0.
  • genefamilies.arrow - Functional profiles generated by HUMAnN, features are UniRef90s
  • ecs.arrow, pfams.arrow, kos.arrow - regrouped stratified functional profiles derived from genefamilies.arrow into Enzyme Commission (EC), Protein Families (Pfams), or Kegg Orthology (KO) labels respectively using the utility function humann_regroup_tables from HUMAnN software.

Other data

CSV files contain additional metadata. Missing values are represented by blank cells, which are read by CSV.jl as missing values. See below for additional details.

Subject and biospecimen metadata

The file complete_filtered_dataset.csv contains subject- and timepoint-associated metadata for all stool samples.

Subjects are referred to by unique IDs (Integers) in the subject column, and each individual visit is referred to in the timepoint column. These two fields can be used to uniquely join to other data tables (eg brain data below).

ECHO-coded timepoints, which refer to the children's developmental stage (eg EC02 for the second "early childhood" stage) are found in the ECHOTPCoded column.

Columns beginning with filter_ are boolean columns that can be used to identify rows that are used in each cohort analysis found in the manuscript. Eg, subsetting on rows where filter_00to120 is true will provide all of the rows used in the cohort of children from birth to 120 months.

The omni column contains biospecimen IDs (they were collected in OmniGene buffer) which can be used to map metadata rows to microbial data found in the arrow files referred to above.

There is a great deal of additional data present in this table - please refer to code repository (below) for descriptions of further headers, and how they are used in the analysis.

Neuroimaging data

The file brain_normalized.csv contains neuroimaging data used in the manuscript. Specifically, brain volume segmentation data normalized to total brain volume (TBV). It contains the following headers:

Subject / timepoint info:

  • subject
  • timepoint

Non-hemisphere-specific brain regions:

  • 3rd-ventricle
  • 4th-ventricle
  • Brain-stem
  • CSF
  • cerebellar-vermal-lobules-I-V
  • cerebellar-vermal-lobules-VI-VII
  • cerebellar-vermal-lobules-VIII-X
  • Gray-matter
  • White-matter

Hemisphere-specific brain regions. Two of each of the following, preceded by right- or left-

  • lateral-ventricle
  • inferior-lateral-ventricle
  • cerebellum-exterior
  • cerebellum-white-matter
  • thalamus-proper
  • caudate
  • putamen
  • pallidum
  • hippocampus
  • amygdala
  • accumbens-area
  • ventral-DC
  • basal-forebrain
  • caudal-anterior-cingulate
  • caudal-middle-frontal
  • cuneus
  • entorhinal
  • fusiform
  • inferior-parietal
  • inferior-temporal
  • isthmus-cingulate
  • lateral-occipital
  • lateral-orbitofrontal
  • lingual
  • medial-orbitofrontal
  • middle-temporal
  • parahippocampal
  • paracentral
  • pars-opercularis
  • pars-orbitalis
  • pars-triangularis
  • pericalcarine
  • postcentral
  • posterior-cingulate
  • precentral
  • precuneus
  • rostral-anterior-cingulate
  • rostral-middle-frontal
  • superior-frontal
  • superior-parietal
  • superior-temporal
  • supramarginal
  • transverse-temporal
  • insula

Sharing/Access information

Code for loading the data and performing analyses is available on github and archived on Zenodo:


See the Readme and setup instructions for using code on github.


Office of the Director, Award: UG/H3OD023313