Skip to main content

Exploiting hierarchy in medical concept embedding

Cite this dataset

Finch, Anthony et al. (2021). Exploiting hierarchy in medical concept embedding [Dataset]. Dryad.



To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.

Materials and Methods

We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.


We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.


We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.


This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories.  We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.

To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system.  KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia.  KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.

For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).

We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters.  Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models.  Each model was trained using dimension k of 10, 50, and 100.  Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’).  Each model was trained for 10 iterations.  We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).

We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison.  We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs.  We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’  Our Med2Vec model benchmark did not include categorical entities or other novel innovations.

Word2Vec embeddings were generated using the GenSim package in Python.  Med2Vec embeddings were generated using the Med2Vec code published by Choi.  The JSON files used in this repository were generated using the JSON package in Python.

Usage notes

This dataset contains pretrained embeddings for ICD-10 codes and Clinical Classification Software (Refined) categories.

The core file in this dataset is the associated embeddings.json file.  This file contains all embeddings described in our article, entitled Exploiting Hirearchy in Medical Concept Embeddings.

Embeddings follow either the ICD-10 coding standard or CCSR category labelling defined in the 2021.1 version of that software.  To download maps from ICD-10 codes to CCSR categories, please visit the H-CUP downloads page here.

This dataset is structured as a multilevel JSON file.


The first level of keys indicates the relevant set of embeddings.  Keys are structured as:

<Algorithm>_<CO-trained or SEParately-trained><_cat if the set is of CCSR categories instead of ICD-10 codes>_embeddings_<Embedding dimension>

For example:

* CBOW_CO_embeddings_10 indicates that the set under this key contains ICD-10 codes (no _cat_ tag) trained using the Continuous Bag-of-Words (CBOW) algorithm without co-training the CCSR categories (SEP).  The embeddings are 10-dimensional (_10).

* SG_SEP_cat_embeddings_100 indicates that the set under this key contains CCSR category embeddings (_cat_ tag) trained using the Skip-Gram (SG) algorithm while co-training with the CCSR categories (CO).  The embeddings are 100-dimensional (_100).

The final key is Med2Vec_embeddings_200.  This set contains 200-dimensional ICD-10 embeddings trained using the Med2Vec model without co-training the CCSR categories.  These embeddings were used as a comparison dataset only.


Under each key is a secondary JSON dictionary.  These dictionaries each contain either 5428 ICD-10 codes or 394 CCSR categories, depending on whether the set in question is a code set or a category set.


Under each code or category key there is a list of k elements, where k is determined by the dimensionality of the embedding space (e.g. 10 for CBOW_CO_embeddings_10).  Keys under this list are numbered 1-k, with each key containing a number that represents the relevant coordinate in embedding space.  Please note that keys for Word2Vec models are stored in standard decimal notation, with double-quotation wrappers.  However, Med2Vec embeddings are stored in scientific notation (e.g. -3.858267374185924e-48) without string markers.  This standard was automatically generated by the Python JSON package and is readily interpretable by that package.

For additional detail on using this package, please refer to the README file.