Skip to main content

Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets

Cite this dataset

Newman-Griffis, Denis et al. (2021). Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets [Dataset]. Dryad.


Objective: Normalizing mentions of medical concepts to standardized vocabularies is a fundamental component of clinical text analysis. Ambiguity—words or phrases that may refer to different concepts—has been extensively researched as part of information extraction from biomedical literature, but less is known about the types and frequency of ambiguity in clinical text. This study characterizes the distribution and distinct types of ambiguity exhibited by benchmark clinical concept normalization datasets, in order to identify directions for advancing medical concept normalization research.

Materials and Methods: We identified ambiguous strings in datasets derived from the two available clinical corpora for concept normalization, and categorized the distinct types of ambiguity they exhibited. We then compared observed string ambiguity in the datasets to potential ambiguity in the Unified Medical Language System (UMLS), to assess how representative available datasets are of ambiguity in clinical language.

Results: We observed twelve distinct types of ambiguity, distributed unequally across the available datasets. However, less than 15% of the strings were ambiguous within the datasets, while over 50% were ambiguous in the UMLS, indicating only partial coverage of clinical ambiguity.

Discussion: Existing datasets are not sufficient to cover the diversity of clinical concept ambiguity, limiting both training and evaluation of normalization methods for clinical text. Additionally, the UMLS offers important semantic information for building and evaluating normalization methods.

Conclusion: Our findings identify three opportunities for concept normalization research, including a need for ambiguity-specific clinical datasets and leveraging the rich semantics of the UMLS in new methods and evaluation measures for normalization.


These data are derived from benchmark datasets released for Medical Concept Normalization research focused on Electronic Health Record (EHR) narratives. Data included in this release are derived from:

  • SemEval-2015 Task 14 (Publication DOI: 10.18653/v1/S15-2051, data accessed through release at
  • CUILESS2016 (Publication DOI: 10.1186/s13326-017-0173-6, data accessed through release at

These datasets consist of EHR narratives with annotations including: (1) the portion of a narrative referring to a medical concept, such as a problem, treatment, or test; and (2) one or more Concept Unique Identifiers (CUIs) derived from the Unified Medical Language System (UMLS), identifying the reification of the medical concept being mentioned.

The data were processed using the following procedure:

  1. All medical concept mention strings were preprocessed with lowercasing and removing of determiners ("a", "an", "the").
  2. All medical concept mentions were analyzed to identify strings that met the following conditions: (1) string occurred more than once in the dataset, and (2) string was annotated with at least two different CUIs, when aggregating across dataset samples.  Strings meeting these conditions were considered "ambiguous strings".
  3. Ambiguous strings were reviewed by article authors to determine (1) the category and subcategory of ambiguity exhibited (derived from an ambiguity typology described in the accompanying article); and (2) whether the semantic differences in CUI annotations were reflected by differences in textual meaning (strings not meeting this criterion were termed "arbitrary").

For more details, please see the accompanying article (DOI: 10.1093/jamia/ocaa269).

Usage notes

CSV files are provided for ambiguous strings in each of SemEval-2015 and CUILESS2016. The attached README.txt file explains each column and how to parse CSV file contents.

Please note that only the ambiguous strings and their CUIs are included in the CSV files; no mentions or unambiguous strings are provided. To analyze string occurrences in context and review unambiguous strings, use the following steps:

  1. Register and request access to the datasets (including signing appropriate Data Usage Agreements), using the links provided above.
  2. Download datasets for analysis; the code we used for our analyses is available at


National Institutes of Health

United States Social Security Administration