Skip to main content

Data from: Learning relevance models for patient cohort retrieval

Cite this dataset

Goodwin, Travis R.; Harabagiu, Sanda M. (2019). Data from: Learning relevance models for patient cohort retrieval [Dataset]. Dryad.


OBJECTIVE We explored how judgements provided by physicians can be used to learn relevance models that enhance the quality of patient cohorts retrieved from Electronic Health Records (EHR) collections. METHODS A very large number of features were extracted from patient cohort descriptions as well as electronic health record collections. Specifically, we investigated retrieving (1) neurology-specific patient cohorts from the Temple University Hospital EEG Corpus as well as (2) the more general cohorts evaluated in the TREC Medical Records Track (TRECMed) from the de-identified hospital records provided by the University of Pittsburgh Medical Center. The features informed a Learning Relevance Model (LRM) that took advantage of relevance judgements provided by physicians. The LRM implements a pairwise learning-to-rank framework, which enables our learning patient cohort retrieval (L-PCR) system to learn from physicians’ feedback. RESULTS AND DISCUSSION We evaluated the L-PCR system against state-of-the-art traditional patient cohort retrieval systems, and observed a 27% improvement when operating on EEGs and a 53% improvement when operating on TRECMed EHRs, showing the promise of the L-PCR system. We also performed extensive feature analyses to reveal the most effective strategies for representing cohort descriptions as queries, encoding EHRs, and measuring relevance. CONCLUSION The learning patient cohort retrieval system has significant promise for reliably retrieving patient cohorts from EHRs in multiple settings when trained with relevance judgments. When provided with additional cohort descriptions, the L-PCR will continue to learn, thus offering a potential solution to the performance barriers of current cohort retrieval systems.

Usage notes