Data from: Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

Guan, Meijian1; Cho, Samuel1; Petro, Robin2; Zhang, Wei2; Pasche, Boris2; Topaloglu, Umit2

Published Feb 06, 2019 on Dryad. https://doi.org/10.5061/dryad.f9m8217

Data files

Feb 06, 2019 version files 95.07 KB

Abstract

Objectives: Natural language processing (NLP) and machine learning approaches were used to build classifiers to identify genomic-related treatment changes in the free-text visit progress notes of cancer patients. Methods: We obtained 5,889 de-identified progress reports (2,439 words on average) for 755 cancer patients who have undergone a clinical Next Generation Sequencing (NGS) testing in Wake Forest Baptist Comprehensive Cancer Center for our data analyses. An NLP system was implemented to process the free-text data and extract NGS-related information. Three types of recurrent neural network (RNN) namely, gated recurrent unit (GRU), long-short term memory (LSTM), and bidirectional LSTM (LSTM_Bi) were applied to classify documents to the treatment-change and no-treatment-change groups. Further, we compared the performances of RNNs to five machine learning algorithms including Naive Bayes (NB), K-nearest Neighbor (KNN), Support Vector Machine for classification (SVC), Random Forest (RF), and Logistic Regression (LR). Results: Our results suggested that, overall, RNNs outperformed traditional machine learning algorithms, and LSTM_Bi showed the best performance among the RNNs in terms of accuracy, precision, recall, and F1 score. In addition, pre-trained word embedding can improve the accuracy of LSTM by 3.4% and reduce the training time by more than 60%. Discussion and Conclusion: NLP and RNN-based text mining solutions have demonstrated advantages in information retrieval and document classification tasks for unstructured clinical progress notes.

Data from: Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

Data files

Abstract

NLP_Report_Process_clean

NonDL_models_clean

LSTM_keras_03012018_clean

visualizeModels_clean

Data from: Natural language processing and recurrent network models for identifying genomic mutation-associated cancer treatment change from patient progress notes

Data files

Abstract

Usage notes

NLP_Report_Process_clean

NonDL_models_clean

LSTM_keras_03012018_clean

visualizeModels_clean

Works referencing this dataset