Multiomics characterization of acute child illness and mortality in Africa and South Asia
Data files
Jan 06, 2026 version files 97.30 KB
-
packages.csv
3.33 KB
-
README.md
8.89 KB
-
scripts.zip
85.08 KB
Feb 10, 2026 version files 97.33 KB
-
packages.csv
3.36 KB
-
README.md
8.89 KB
-
scripts.zip
85.08 KB
Abstract
Childhood illnesses from infectious diseases in low- and middle-income countries contribute substantially to the global under-five mortality. Many hospitalised children experience incomplete recovery, readmission, and post-discharge mortality despite guideline-directed care. However, targeted interventions remain elusive due to a limited understanding of underlying mechanisms. In this work, we employ multiomic profiling and multivariate modeling to investigate biological drivers of inpatient and post-discharge mortality in 3,101 acutely ill children across nine sites in sub-Saharan Africa and South Asia. In a nested case-cohort (N=1,008), we generate plasma proteomics, serum metabolomics and lipidomics, stool metagenomics, and fecal pathogen data at admission and discharge. Additionally, we profile 270 geographically matched community children for biological baselines. We identify a generalizable mortality signature marked by immune, inflammatory, and metabolic dysregulation with gut dysbiosis. We show that mortality-associated signals persist from admission through discharge, indicating unresolved disease, and that malnourished children show greater baseline perturbations, explaining elevated risk. We also find some children with low clinical severity display high predicted mortality risk from targeted biomarkers. Finally, we distill predictive models to a clinically feasible biomarker panel and validate our findings in an independent cohort (N=100). By linking inpatient and post-discharge mortality to specific biological mechanisms, our findings highlight why current care can fail and demonstrate how biomarker-guided risk stratification can identify vulnerable children currently missed by clinical assessments, enabling targeted interventions to reduce mortality in low- and middle-income countries.
Data Accessibility Statement
Due to the sensitive nature of the data collected for this study, access to the underlying data is restricted and overseen by the relevant Data Governance Committee. The dataset and detailed access instructions are available via the Harvard Dataverse (https://doi.org/10.7910/DVN/X6FAGX). The code files included in this submission are compatible with the CC0 waiver required by Dryad for publication; however, the input files required to run the code are not included due to data sensitivity constraints. If there are any questions or concerns regarding data or code availability, readers are encouraged to contact the corresponding author of this dataset.
Scripts
The set of scripts (scripts.zip) to analyze the data from this study and generate all the results and figures presented in the manuscript are contained in the scriptsisp file and are described below. The script should be run in the sequence presented.
packages.csv
List of packages and version numbers for all packages needed for reproduction of the analyses in the manuscript.
figure_0.1_install_packages.R
Sets up the project environment using the renv package, ensuring Bioconductor version 3.17 and any specific package versions needed for the analysis.
figure_0.2_helper_functions.R
A set of custom helper functions needed across the analysis includes model training with XGBoost, pathway analysis, model performance assessment, etc.
figure_1.1_dataset_description.R
Generates descriptive statistics the number of patients per sfor ite and their relationship with malnutrition strata. Also generates descriptive statistics for the primary clinical and multiomic feature sets.
figure_1.2_plotting.R
The visualization script for Figure 1. This script loads and processes multi-cohort patient and omics data, then generates bar plots and alluvial diagrams to visualize participant composition, outcomes, malnutrition strata, and dataset size/modularity across cohorts.
figure_2.1_survival_model.R
Executes the predictive workflow for the admission cohort (A0), training individual modality models and an integrated multiomic model while extracting "Gain" scores to rank feature importance. Calculates performance metrics for models trained.
figure_2.1.1_survival_model_site_analysis.R
Evaluates model generalizability through leave-one-site-out cross-testing and uses mixed-effects modeling and meta-regression to identify features with consistent effects across all nine study sites.
figure_2.1.2_survival_model_across_time.R
Slices survival data into Early, Mid, and Late windows (0-3d, 3-14d, >14d) to determine if admission biological signals remain predictive of late-occurring mortality.
figure_2.1.3_survival_model_external_factors.R
Correlates multiomic risk scores with "domain scores" such as Caregiver Characteristics and Access to Healthcare, and calculates Hedges' g for specific clinical signs to see how biological risk aligns with physical symptoms.
figure_2.2_feature_deviations.R
Identifies time-specific biological markers by running Wilcoxon tests and Hedges' g calculations specifically for patients who died within discrete timeframes (0-3d, 3-14d, >14d).
figure_2.3_omic_distances.R
Quantifies biological "dysregulation" by scaling patient data against a community baseline (CP) and calculating the L2 norm (Euclidean distance) for each patient’s omic profile relative to healthy community children.
figure_2.4_pathway_analysis.R
Prepares data for biological interpretation, including proteomic, lipidomic, and metabolomic pathway analysis.
figure_2.4.1_metabolomics_validation.R
Performs a direct comparison between the discovery cohort's metabolomic fold-changes and a secondary study (Wen et al., 2022) to confirm the robustness of metabolic mortality signals.
figure_2.5_minimal_model.R
Uses a bootstrapping approach and XGBoost importance rankings to identify a minimal set of features (10 omic markers plus anthropometry) that retain the predictive power of the full multiomic set.
figure_2.6_plotting.R
The visualization script for Figure 2. This script evaluates and visualizes multi-omic survival prediction models, comparing performance across datasets, sites, time windows, and model types using ROC/PR curves, effect sizes, and pathway enrichment analyses.
figure_3.1_clinical_vs_omics.R
Defines risk subgroups based on the "error" (difference) between integrated multiomic scores and clinical scores and characterizes these subgroups.
figure_3.2_minimal_model.R
Trains specialized predictive models to distinguish between biological and clinical risk categories, identifying the specific biomarkers that signal high risk when clinical exams appear normal.
figure_3.3_omic_discrepancy.R
Computes an "Omic Discrepancy" score (standard deviation across modality scores per patient) to identify individuals with highly discordant biological signals across different systems and characterizes these subgroups.
figure_3.4_plotting.R
The visualization script for Figure 3. This script analyzes clinical and multi-omic risk prediction in an admission cohort by comparing clinical, integrated, and minimal models. It generates figures showing risk score agreement, survival curves, model performance (AUROC/AUPRC), omic discrepancy, and pathway enrichment. It also identifies key clinical and omic features associated with outcomes using statistical testing and heatmap visualizations.
figure_4.1_survival_model_discharge.R
Replicates the integrated modeling and biological distance calculations for the discharge (D0) cohort to predict post-discharge mortality. Calculates performance metrics for models trained.
figure_4.2_clinical_vs_omics_discharge.R
Defines risk subgroups based on the difference between integrated multiomic scores and clinical scores at discharge and characterizes these subgroups.
figure_4.3_omic_discrepancy_discharge.R
Computes an "Omic Discrepancy" score to identify individuals at discharge with highly discordant biological signals across different systems and characterizes these subgroups.
figure_4.4_admission_vs_discharge_model.R
Tests model reciprocity by training an admission model to predict discharge outcomes and vice versa to assess the stability of the mortality signature over time.
figure_4.5_hospitalization_model.R
Focuses on longitudinal change by calculating "Delta" values (D0 - A0) for all multiomic features and testing their association with mortality with predictive modeling.
figure_4.6_plotting.R
The visualization script for Figure 4.
figure_5.1_malnutrition_strata_analysis.R
Investigate if malnutrition-related mortality risk is mediated by the same multiomic signals or if different biological drivers exist for children at different levels of malnutrition.
figure_5.2_age_group_analysis.R
Investigates if age-related Investigatesk is mediated by the same multiomic signals or if different biological drivers exist for children of different ages.
figure_5.3_plotting.R
The visualization script for Figure 5. This script examines how age and malnutrition modify multi-omic survival signals by performing mediation, stratified performance, and discrepancy analyses across admission and discharge cohorts. It visualizes how these characteristics influence model performance, omic distances, and feature effect variability.
figure_6.1_survival_model_validation.R
Validates the trained multiomic and minimal models on the independent validation (V0) cohort.
figure_6.2_clinical_vs_omics_validation.R
Validates the risk subgrouping logic (e.g., Biological Risk Not Clinically Reflected) in the validation cohort.
figure_6.3_plotting.R
The visualization script for Figure 6. This script evaluates and validates proteomic and clinical survival risk models across discovery and validation cohorts, comparing predictive performance, feature consistency, and omic distances. It generates figures assessing model discrimination, calibration, and biological agreement between cohorts.
Requirements
- R version 4.3.1
Human subjects data
All caregivers provided written informed consent for their child to participate in the study and for the data produced to be accessible after de-identification. Caregivers were assured that any report or publications on this study would not use participant’s names or identities. They agreed to information being shared with other researchers in ways that do not reveal individual participants’ identities. Any information that could identify people, such as their names and general locations, were replaced with number codes.
The study population consisted of a nested case-cohort (NCC) study (the discovery cohort) within the Childhood Acute Illness and Nutrition (CHAIN) Network cohort. The CHAIN study recruited 3,101 children from nine sites across six countries: Bangladesh, Burkina Faso, Kenya, Malawi, Pakistan, and Uganda. Children were stratified by nutritional status using mid-upper arm circumference (MUAC) during enrolment at hospital admission, and followed up for 180 days after discharge. Geographically-matched community participants were included as a comparison group. The study was approved by the institutional review boards of all partner sites. The discovery cohort consisted of a random 24% sub-cohort of children stratified by site, including 658 survivors (non-cases) and 109 deaths (cases). Additionally, all remaining deaths (241 cases) not included in the random sub-cohort were added, resulting in a total of 350 cases. Another 30 randomly selected community participants from each site (a total of 270) were also included.
Collection and processing of all sample types were performed according to harmonized operating procedures at all study sites. Samples were collected at admission, discharge, and follow-up, and included stool, fecal swabs, whole blood, serum, plasma, and dry blood spots. Sample processing occurred under cold-chain conditions before transfer to the KEMRI/Wellcome Trust Research Programme biorepository in Kilifi, Kenya. Proteomic features were generated using the SomaScan aptamer-based assay in plasma. Serum metabolomic and lipidomic features were generated using targeted and untargeted mass spectrometry techniques, respectively. Metagenomic features were generated through sequencing of DNA from stool samples. TaqMan Array Card (TAC) features were generated from nucleic acids extracted from fecal swabs.
Changes after Jan 6, 2026:
Updating packages.csv with an additional 2 packages which were used to generate Figure 1A.
