Predictive modeling for clinical features associated with Neurofibromatosis Type 1

Payne, Philip 1 ; Morris, Stephanie1; Gupta, Aditi1; Kim, Seunghwan1; Foraker, Randi1; Gutmann, David1

Published Mar 10, 2022 on Dryad. https://doi.org/10.5061/dryad.nvx0k6drn

Data files

Mar 10, 2022 version files 11.95 KB

NEUROCLINCPRACT-2020-062901_DataSheet1.csv

2.28 KB
NEUROCLINCPRACT-2020-062901_DataSheet2.csv

1.99 KB
NEUROCLINCPRACT-2020-062901_DataSheet3.csv

1.26 KB
Payne_dataset_readme.txt

6.43 KB

Abstract

Objective: Perform a longitudinal analysis of clinical features associated with Neurofibromatosis Type 1 (NF1) based on demographic and clinical characteristics, and to apply a machine learning strategy to determine feasibility of developing exploratory predictive models of optic pathway glioma (OPG) and attention-deficit/hyperactivity disorder (ADHD) in a pediatric NF1 cohort.

Methods: Using NF1 as a model system, we perform retrospective data analyses utilizing a manually-curated NF1 clinical registry and electronic health record (EHR) information, and develop machine-learning models. Data for 798 individuals were available, with 578 comprising the pediatric cohort used for analysis.

Results: Males and females were evenly represented in the cohort. White children were more likely to develop OPG (OR: 2.11, 95%CI: 1.11-4.00, p=0.02) relative to their non-white peers. Median age at diagnosis of OPG was 6.5 years (1.7-17.0), irrespective of sex. Males were more likely than females to have a diagnosis of ADHD (OR: 1.90, 95%CI: 1.33-2.70, p<0.001), and earlier diagnosis in males relative to females was observed. The gradient boosting classification model predicted diagnosis of ADHD with an AUROC of 0.74, and predicted diagnosis of OPG with an AUROC of 0.82.

Conclusions: Using readily available clinical and EHR data, we successfully recapitulated several important and clinically-relevant patterns in NF1 semiology specifically based on demographic and clinical characteristics. Naïve machine learning techniques can be potentially used to develop and validate predictive phenotype complexes applicable to risk stratification and disease management in NF1.

Patients and Data Description

This study was performed using retrospective clinical data extracted from two sources within the Washington University Neurofibromatosis (NF) Center. First, data were extracted from an existing longitudinal clinical registry that was manually curated using clinical data obtained from patients followed in the Washington University NF Clinical Program at St. Louis Children’s Hospital. All individuals included in this database had a clinical diagnosis of NF1 based on current National Institutes of Health Consensus Development Conference diagnostic criteria,9 and had been assessed over multiple visits from 2002 to 2016 for the presence of clinical features associated with NF1. Data points in this registry included demographic information, such as age, race, and sex, in addition to NF1-related clinical features and associated conditions, such as café-au-lait macules, skinfold freckling, cutaneous neurofibromas, Lisch nodules, OPG, hypertension, ADHD, and cognitive impairment. These data were maintained in a semi-structured format containing textual and binary fields, capturing each individual’s data over multiple clinical visits. From these data, clinical features and phenotypes were extracted using data manipulation, imputation, and text mining techniques. Data obtained from this NF1 clinical registry were converted to data tables, which captured each patient visit and the presence/absence of specific clinical features at each visit. Clinical features which were once marked as present were assumed to be present for all future visits, and missing data were assumed absent for that specific visit. Categorical variables are reported as frequencies and proportions, and compared using odds ratios (ORs). Continuously distributed traits, adhering to both conventional normality assumptions and homogeneity of variances, are reported as mean and standard deviations, and compared using analysis of variance methods. Non-parametric equivalents were used for data with non-normative distributions.

Clinical Feature Extraction from Clinical Registry and EHR

The NF1 Clinical Registry comprised string-based clinical feature values, such as ADHD, OPG, and asthma. From these data, we extracted 27 unique clinical features in addition to longitudinal data on the development of NF1-related clinical features and associated diagnoses. For each clinical feature, age at initial presentation and/or diagnosis was computed, and median age of occurrence was calculated for each sex. The exact age of presentation and/or diagnosis could not be definitively ascertained for any feature that was present at a child’s initial clinic visit. As such, we computed the age of diagnosis only for those clinical features for which we have at least one visit documenting feature absence prior to the manifestation of that feature.

Diagnosis codes from the EHR-derived data set were also extracted. Diagnosis codes were recorded as 15,890 unique ICD 9/10 codes. Given the large number of ICD 9/10 codes, a consistent, concept-level “roll up” of relevant codes to a single phenotype description was created by mapping the extracted ICD 9/10 values to phenome-wide association (PheWAS) codes called Phecodes, which have been demonstrated to better align with clinical disease compared to individual ICD codes.

Machine Learning Analyses

Using a combination of clinical features obtained from the NF1 Clinical Registry and EHR-derived data sets, we developed prediction models using a gradient boosting platform for identifying patients with specific NF1-related diagnoses to establish the usefulness of clinical history and documentation of clinical findings in predicting phenotypic variability of NF1. Initial analyses used a state-of-the-art classification algorithm, gradient boosting model, which uses a tree-based algorithm to produce a predictive model from an ensemble of weak predictive models. Gradient boosting model was selected, as it supports identifying importance of features used in the final prediction model. Subsequent analyses employed training each model for three different feature sets: (1) demographic features for all patients, including race, sex, and family history of NF1 [5 features]; (2) clinical features associated with NF1 [27 features] extracted from the NF1 Clinical Registry; and (3) diagnosis codes extracted from the EHR data, which were reduced to 50 Phecodes. Four-fold cross validation was then applied for the three models, and comparisons for the prediction accuracies of each model determined. Positive predictive value (PPV), F1 score and the area under the receiver operator characteristic (AUROC) curve were used as evaluation metrics. Scikit Learn, a machine learning library in Python, was employed to implement all analyses.

Standard Protocol Approvals, Registrations, and Patient Consents

The NF1 Clinical Registry is an existing longitudinal clinical registry that was manually curated using clinical data obtained from patients followed in the Washington University NF Clinical Program at St. Louis Children’s Hospital. All individuals included in this database have a clinical diagnosis of NF1 based on current National Institutes of Health criteria and have provided informed consent for participation in the clinical registry. All data collection, usage and analysis for this study were approved by the Institutional Review Board (IRB) at the Washington University School of Medicine.