Discovery of sparse, reliable omic biomarkers with Stabl

Hédou, Julien 1 ; Marić, Ivana1 ; Bellan, Grégoire2 ; Einhaus, Jakob1 ; Gaudillière, Brice1

Published Oct 12, 2023 on Dryad. https://doi.org/10.5061/dryad.stqjq2c7d

Data files

Oct 12, 2023 version files 19.91 MB

data.zip

19.90 MB
README.md

7.82 KB

Abstract

Adoption of high-content omic technologies in clinical studies, coupled with computational methods, have yielded an abundance of candidate biomarkers. However, translating such findings into bona fide clinical biomarkers remains challenging. To facilitate this process, we introduce Stabl, a general machine learning framework that identifies a sparse, reliable set of biomarkers by integrating noise injection and a data-driven signal-to-noise threshold into multivariable predictive modeling. Evaluation of Stabl on synthetic datasets and five independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used sparsity-promoting regularization methods while maintaining predictive performance; it distills datasets containing 1,400 to 35,000 features down to 4 to 34 candidate biomarkers. Stabl extends to multi-omic integration tasks, enabling biological interpretation of complex predictive models, as it hones in on a shortlist of proteomic, metabolomic, and cytometric events predicting labor onset, microbial biomarkers of preterm birth, and a pre-operative immune signature of post-surgical infections.

This is a scikit-learn compatible Python implementation of Stabl, coupled with useful functions and
example notebooks to rerun the analyses on the different use cases located in the Sample data folder of the code library and in the data.zip folder of this repository

Requirements

Python version : from 3.7 up to 3.10

Python packages:

joblib == 1.1.0
tqdm == 4.64.0
matplotlib == 3.5.2
numpy == 1.23.1
cmake == 3.27.1
knockpy == 1.2
scikit-learn == 1.1.2
seaborn == 0.12.0
groupyr == 0.3.2
pandas == 1.4.2
statsmodels == 0.14.0
openpyxl == 3.0.7
adjustText == 0.8
scipy == 1.10.1
julia == 0.6.1
osqp == 0.6.2

Julia package for noise generation (version 1.9.2) :

Bigsimr == 0.8.7
Distributions == 0.25.98
PyCall == 1.96.1

Installation

Julia installation

To install Julia, please follow these instructions:

Download Julia from here.
Follow the instructions for your operating system here.
Install the required julia packages :

julia -e 'using Pkg; Pkg.add(name="Bigsimr", version="0.8.7"); Pkg.add(name="Distributions", version="0.25.98"); Pkg.add(name="PyCall", version="1.96.1"); Pkg.add("IJulia")

Finally, install Julia for python:

pip install julia
python -c "import julia; julia.install()"

CMake installation

In order to install the python libraries required to generate the noise, we need to install :

CMake (v3.27.4 for MacOS)

You can install this module by :

using the default system package manager, like on this website
following instructions on CMake.

Python installation (>= 3.7 and < 3.11)

Install Directly from github:

pip install git+https://github.com/gregbellan/Stabl.git
pip install numpy==1.23.2

Download Stabl:

git clone https://github.com/gregbellan/Stabl.git

Install requirements and Stabl:

cd Stabl
pip install .
pip install numpy==1.23.2

The general installation time is less than 10 seconds, and have been tested on mac OS and linux system.

NOTE: There is a behavior with Julia library:

you can run the script in a notebook, but you need to run the import block two times. The first will throw an error and the second one will finalize the import.
It is not possible to run the script in command line if you are installing the library with conda

To resolve this issue, either you install the library without conda or you run the script into a notebook. If there is still an issue with Julia in a notebook, run the following command in the first cell of the notebook:

from julia.api import Julia
jl = Julia(compiled_modules=False)

Use of the library

To use the library and the associated benchmark in the folder Notebook examples, you need to download the repository :

git clone https://github.com/gregbellan/Stabl.git
cd Stabl/
unzip Sample\ Data/data.zip -d Sample\ Data/

Benchmarks

Tutorial Notebook.ipynb: Tutorial on how to use the library
run_cv_*.py: Python scripts to run the sample datas in Cross-Validation
run_val_*.py: Python scripts to run the sample datas in Training-Validation
run_synthetic_*.py: Python scripts to run the synthetic benchmarks

NOTE: The different scripts may take some time to begin because of the dependence with julia. However, once started, the time to run should come back to normal

Input data

When using your own data, you have to provide

The preprocessed input data matrix (preferably a pandas DataFrame having column names)
The outcomes (preferably a pandas Series having a names)
(Input Data and outcomes should have the same indices)

Sample Data

NB: for all csv file, the first column always corresponds to the patient ID.
data.zip contains the data for the following use cases:

Onset of Labor

more information at doi: 10.1126/scitranslmed.abd9898

Training

Outcome (DOS.csv): Days before Labor – 150 samples – 53 patients – negative continuous data
Patient ID (ID.csv): Corresponding patient ID for each sample – 150 samples – discrete data (53 ≠ IDs)
Proteomics (Proteomics.csv): 150 samples – 1317 biomarkers – continuous data
CyTOF (CyTOF.csv): 150 samples – 1502 biomarkers – continuous data
Metabolomics (Metabolomics.csv): 150 samples – 3529 biomarkers – continuous values

Validation

Outcome (DOS_validation.csv): Days before Labor, 27 samples – 10 patients – negative continuous data
Proteomics (Proteomics_validation.csv): 21 samples – 1317 biomarkers – continuous data
CyTOF (CyTOF_validation.csv): 27 samples – 1502 biomarkers – continuous data

COVID-19

Training

more information at doi: 10.1016/j.xcrm.2022.100680

Outcome (Mild&ModVsSevere.csv): Mild/Moderate (43) Vs. Severe (25) Covid-19 cases – Categorical binary values (0=midl/moderate, 1=severe)
Proteomics (Proteomics.csv): 68 samples – 1463 biomarkers – Continuous data

Validation

more information at doi: 10.1016/j.xcrm.2021.100287

Outcome (Validation_outcome(WHO.0≥5).csv): Mild/Moderate (125) Vs. Severe (659) – Categorical binary values (0=midl/moderate, 1=severe)
Proteomics (Validation_Proteomics.csv): 784 samples – 1420 biomarkers – Continuous data

CFRNA (cell-free RNA data to predict preeclampsia)

more information at doi: 10.1038/s41586-022-04410-z and doi: 10.1016/j.patter.2022.100655

Training

Outcome (all_outcomes.csv): Control (63) Vs. Preeclampsia (96) – 48 patients – Categorical binary values (False=control, True=preeclampsia)
Patient ID (ID.csv): Corresponding patient ID for each sample – 159 samples – discrete data (48 ≠ IDs)
CFRNA (cfrna_dataFINAL.csv): 159 samples – 37184 biomarkers – Continuous data

Surgical Site Infections (SSI)

Data extracted from a clinical study of patients undergoing nonurgent major abdominal colorectal surgery were prospectively enrolled between 07/11/2018 and 11/11/2020 at Stanford University Hospital after approval by the Institutional Review Board of Stanford University and the obtention of written informed consent (IRB-46978).

Training

Outcome (outcome.csv): Control (77) Vs. SSI (16) – Categorical binary values (0=control, 1=patient with SSI)
CyTOF (CyTOF.csv): 93 samples – 1125 biomarkers – Continuous data
Proteomics (Proteomics.csv): 91 samples – 721 biomarkers – Continuous data

Dream (data from the DREAM challenge)

more information at doi: 10.1101/2023.03.07.23286920

Training

Outcome (Preterm.csv): Preterm (609) Vs. Non-preterm (960) – 580 patients – Categorical binary values (False=Term, True=Preterm)
Patient ID (Patients_id.csv): Corresponding patient ID for each sample – 1569 samples – discrete data (580 ≠ IDs)
Taxonomy (Taxonomy.csv): 1569 samples – 3725 biomarkers – Continuous data
Phylotype (Phylotype.csv): 1569 samples – 5468 biomarkers – Continuous data

Discovery of sparse, reliable omic biomarkers with Stabl

Data files

Abstract

README: Stabl: sparse and reliable biomarker discovery in predictive modeling of high-dimensional omic data

Requirements

Installation

Julia installation

CMake installation

Python installation (>= 3.7 and < 3.11)

Use of the library

Benchmarks

Input data

Sample Data

Onset of Labor

Training

Validation

COVID-19

Training

Validation

CFRNA (cell-free RNA data to predict preeclampsia)

Training

Surgical Site Infections (SSI)

Training

Dream (data from the DREAM challenge)

Training

Works referencing this dataset