Predictive models for secondary Epilepsy in patients with acute Ischemic Stroke within one year
Abstract
Objective: Post-stroke epilepsy (PSE) is a critical complication that worsens both prognosis and quality of life in patients with ischemic stroke. An interpretable machine learning model was developed to predict PSE using medical records from four hospitals in Chongqing.
Methods: Medical records, imaging reports, and laboratory test results from 21,459 ischemic stroke patients were collected and analyzed. Univariable and multivariable statistical analyses identified key predictive factors. The dataset was split into a 70% training set and a 30% testing set. To address class imbalance, the Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbors was employed. Nine widely used machine learning algorithms were evaluated using relevant prediction metrics, with SHAP (SHapley Additive exPlanations) used to interpret the model and assess the contributions of different features.
Results: Regression analyses revealed that complications such as hydrocephalus, cerebral hernia, and deep vein thrombosis, as well as specific brain regions (frontal, parietal, and temporal lobes), significantly contributed to PSE. Factors such as age, gender, NIH Stroke Scale (NIHSS) scores, and laboratory results like WBC count and D-dimer levels were associated with increased PSE risk. Tree-based methods like Random Forest, XGBoost, and LightGBM showed strong predictive performance, achieving an AUC of 0.99.
Conclusion: The model accurately predicts PSE risk, with tree-based models demonstrating superior performance. NIHSS score, WBC count, and D-dimer were identified as the most crucial predictors.
README: Lasso-ML: Machine Learning Code and Data for Stroke Study
Overview
This repository contains machine learning code, data and relevant materials focused on a stroke study using Lasso regression techniques. It includes data preprocessing, model training, cross-validation, and statistical analysis.
Contents
data.csv: data of the study, contains varables as follows, among the category variables, 0 represent negtive ,1 represent positive :
second_epilepsy - Secondarily acquired epilepsy, the dependent variable of this study
Complications
uremia - A condition characterized by high levels of urea in the blood, typically due to kidney failure
dvt - Deep vein thrombosis
fatty_liver - Fatty liver disease or steatosis
diabetes - Diabetes mellitus
hypertension - High blood pressure
coronary_disease - Coronary artery disease
atrial_fibrillation - A type of irregular heartbeat
cerebral_hernia - Cerebral herniation, a condition where the brain is displaced from its normal position
hydrocephalus - An excess of cerebrospinal fluid in the brain
hyperuricemia - High levels of uric acid in the blood
hyperlipidaemia - High levels of lipids in the blood, also known as hyperlipidemia
hypoproteinemia - Low levels of protein in the blood
affected regions and vessels identified by radiology results
frontal_lobe - Frontal lobe of the brain
parietal_lobe - Parietal lobe of the brain
temporal_lobe - Temporal lobe of the brain
occipital_lobe - Occipital lobe of the brain
insular_lobe - Insular cortex of the brain
range_lobe - the summary of frontal_lobe, parietal_lobe, temporal_lobe, occipital_lobe and insular_lobe
basal_ganglia - Basal ganglia, a group of nuclei in the brain
capsula_interna - Internal capsule, a part of the brain
brainstem - Brainstem, the lower part of the brain
epencephalon - Epencephalon, a term that is not commonly used in modern medical terminology
paraventricular - Paraventricular, often referring to the paraventricular nucleus in the hypothalamus
centrum_semiovale - Semioval center, a part of the brain
thalamus - Thalamus, a part of the brain
aca - Anterior cerebral artery
mca - Middle cerebral artery
pca - Posterior cerebral artery
va - Vertebral artery
ba - Basilar artery
cca_plaque - Common carotid artery plaque
ica_plaque - Internal carotid artery plaque
eca_plaque - External carotid artery plaque
subcortex_lobe - Subcortical lobe, referring to areas beneath the cortex
ant_circle - Anterior circle
post_circle - Posterior circle
large_ves_as - Large vessel disease
laboratory results:
- plt - Platelet count - x10^9/L (cells/L)
- wbc - White blood cell count - x10^9/L (cells/L)
- rbc - Red blood cell count - x10^12/L (cells/L)
- hba1c - Hemoglobin A1c - %
- crp - C-reactive protein - mg/L
- tg - Triglycerides - mg/dL
- ldl - Low-density lipoprotein - mg/dL
- hdl - High-density lipoprotein - mg/dL
- ast - Aspartate aminotransferase - U/L
- alt - Alanine aminotransferase - U/L
- bilirubin - Bilirubin - µmol/L
- albumin - Albumin - g/L
- urea - Urea - mmol/L
- creatinine - Creatinine - µmol/L
- bua - Blood uric acid - µmol/L
- pt - Prothrombin time - seconds
- aptt - Activated partial thromboplastin time - seconds
- tt - Thrombin time - seconds
- inr - International normalized ratio - ratio
- d_dimer - D-dimer - ng/mL
- fibrinogen - Fibrinogen - g/L
- ck - Creatine kinase - U/L
- ck_mb - Creatine kinase MB - U/L
- ldh - Lactate dehydrogenase - U/L
- hbdh - Hydroxybutyrate dehydrogenase - U/L
- ima - Ischemia-modified albumin - absorbance units
- lactate - Lactate - mmol/L
- anion_gap - Anion gap - mmol/L
- tco2 - Total carbon dioxide - mmol/L
age - Age of the patients - years
nihss - National Institutes of Health Stroke Scale, a scale used to measure stroke severity
Code.zip can be decompressed as follows:
externalTest.ipynb
: Code for external testing of the model.fillDropThings.py
: Script for data preprocessing tasks,filling the missing data by RF.lasso_ml.ipynb
: Main notebook implementing Lasso regression, model constructing and SHAP analysis.lasso_ml_cross_validation.ipynb
: Notebook for performing 5 fold cross-validation version of lasso_ml.statistics_analysis.ipynb
: Notebook for statistical analysis of the results.
Requirements
- Python 3.x
- Jupyter Notebook
- Required libraries: numpy, pandas, scikit-learn, matplotlib, seaborn
Installation
- Clone the repository:
bash git clone https://github.com/conanan/lasso-ml.git
Methods
This is a large data extracted from database of several Chongqing Hospitals about stroke patients, we collected almost all records, radiology results and labratory results. The orignal data contains information gererated by a postgresql database. If your and your team do similar study about stroke, please contact and collaborate with us.