On embedding-based automatic mapping of clinical classification system: Handling linguistic variations and granular inconsistencies
Data files
Jan 13, 2026 version files 507.91 MB
-
data.tar.xz
507.90 MB
-
README.md
7.63 KB
Abstract
Objectives: Mapping clinical classification systems, such as the International Classification of Diseases (ICD), is essential yet challenging. While the manual mapping method remains labour-intensive and lacks scalability, existing embedding-based automatic mapping methods, particularly those leveraging transformer-based pre-trained encoders, encounter two persistent challenges: (1) linguistic variation and (2) varying granular details in clinical conditions.
Materials and methods: We introduce an automatic mapping method that combines the representational power of pre-trained encoders with the reasoning capability of large language models (LLMs). For each ICD code, we generate: (1) hierarchy-augmented (HA) and (2) LLM-generated (LG) descriptions to capture rich semantic nuances, addressing linguistic variation. Furthermore, we introduced a prompting framework (PR) that leverages LLM reasoning to handle granularity mismatches, including source-to-parent mappings.
Results: Chapter-wise mappings were performed between ICD versions (ICD-9-CM↔ICD-10-CM and ICD-10-AM↔ICD-11) using multiple LLMs. The proposed approach consistently outperformed the baseline across all ICD pairs and chapters. For example, combining hierarchy-augmented descriptions with Qwen3-8B–generated descriptions yielded an average Top-1 accuracy improvement of 6.67% across the mapping cases. A small-scale pilot study further indicated that HA+LG remains effective in more challenging one-to-many mappings.
Discussion and conclusions: Our findings demonstrate that integrating the representational power of pre-trained encoders with LLM reasoning offers a robust, scalable strategy for automatic ICD mapping.
This dataset (data.tar.xz) comprises chapter-wise International Classification of Diseases (ICD) codes across four ICD versions: ICD-9-CM, ICD-10-CM, ICD-10-AM, and ICD-11. It covers three clinical chapters—Diseases of the Digestive System (Dig), Infectious and Parasitic Diseases (Inf), and Diseases of the Respiratory System (Resp). In addition, the dataset includes ground-truth mappings for multiple ICD version pairs. To support reproducibility and downstream research, we also provide the LLM-generated descriptions and the raw embeddings.
Description of the data and file structure
File Structure Overview
./
├── *.csv <- ICD code files (root level)
├── gt/ <- Ground-truth mappings
├── summary/ <- LLM-generated summaries
├── emb/ <- Raw embeddings
├── testset.json <- Test cases used in rule-based prompting (RP)
└── README.md
Root Level Files:
{icd_version}_{chapter}.csv- ICD codes organized by version and chapter- Example:
icd9cm_dig.csvrefers to ICD-9-CM codes for the Disease of the Digestive System chapter
- Example:
gt/ (Ground-truth Mappings):
{source}_{target}.pkl- Mapping files between ICD versions- Example:
icd9cm_icd10cm.pklrefers to ground-truth mappings for ICD-9-CM to ICD-10-CM.
- Example:
summary/ (LLM Summaries):
{model_name}/summary_run_{n}.tsv- LLM-generated summaries grouped by LLM model and run- Example:
Qwen3-8B/summary_run_1.tsvmeans summaries generated by Qwen3-8B in run #1 (i.e., first run)
- Example:
emb/ (Embeddings):
Two main approaches (however, both use the same sentence-transformer (SBERT) to obtain final embeddings):
terms-only/{encoder_model}/{icd_version}_{chapter}.pkl- Embeddings using only ICD code descriptions
- Example:
terms-only/clinicalbert/icd9cm_dig.pklrefers to the embeddings generated byClinicalBERTfor ICD-9-CM codes for (Dig chapter).
sbert-all-mpnet-base-v2/{approach}/{model}/run_{n}/{icd_version}_{chapter}.pklha/- using Hierarchy-augmented descriptionslg/- using LLM-generated descriptions- Example:
sbert-all-mpnet-base-v2/lg/qwen3-8B/run_1/icd9cm_dig.pklrepresents the embeddings obtained using the Qwen3-8B generated descriptions in run #1 for ICD-9-CM (Dig chapter).
Data Structures
-
ICD codes (e.g.,
./icd9cm_dig.csv):Comma Separated Values (CSV) file with the following columns:
- code: ICD code (e.g., 5200)
- code_desc: Corresponding code descriptions (e.g., Anodontia)
- p_1 to p_n: Categorical labels with p_n being the root node.
-
Summary data files (e.g.,
./summary/qwen3-8b/summary_run_1.tsv):Tab-Separated Values (TSV) file with the following columns:
- code: ICD code
- code_desc: Corresponding code descriptions
- summary: LLM-generated summary
-
Ground-truth files:
Python pickle file containing a dictionary object:
- key: Source ICD code (string)
- value: List of corresponding target ICD codes (list of strings)
- Example:
{"0010": ["A000"]}
-
Embeddings:
Python pickle file containing a dictionary object:
- key: ICD code (string)
- Value: 768-dimensional embedding vector (torch.Tensor)
- Requires PyTorch to load and manipulate
-
testset.json:
JSON file containing a list of test cases used for rule-based prompting (RP) evaluation.
Structure:
- Format: JSON array of objects
- Each object contains:
anchor: Source ICD code description (string)options: Dictionary of candidate target codes with their descriptions- Key: Target ICD code (string)
- Value: Code description (string)
gt: Ground-truth mapping code (string)pred: Prediction output from the prompting framework (string or None)
Example:
[ { "anchor": "Chronic meningococcemia", "options": { "1C1C20": "acute meningococcaemia", "1C1C2Y": "other specified meningococcaemia", "1C1C2Z": "meningococcaemia, unspecified" }, "gt": "1C1C2Y", "pred": None } ](Note: The testset.json is used as part of a case-study to analyze the effect of introducing manually created rules in the prompts (i.e., rule-based prompting (RP)). Here we take some of the incorrect instances where the simple prompting framework either results in invalid selection or rejection.)
Data Source
The dataset uses a publicly available ICD dataset provided by various organizations. The table below provides details on the ICD dataset, including the version, provider and resource URL to access it.
| ICD Version | Version | Provider | Resource URL |
|---|---|---|---|
| ICD-9-CM | Version 32 | Centres for Medicare and Medicaid Services (CMS) | https://www.cms.gov/medicare/coding-billing/icd-10-codes |
| ICD-10-CM | FY22 Release | Centers for Disease Control and Prevention (CDC) | https://www.cdc.gov/nchs/icd/icd-10-cm/files.html |
| ICD-10-AM | 12th Edition | Independent Health and Aged Care Pricing Authority (IHACPA) | https://www.ihacpa.gov.au/resources/icd-10-amachiacs-twelfth-edition |
| ICD-11 | Release ID: 25-01 (WHO API 2.5) | World Health Organization | https://icd.who.int/icdapi |
For ground-truth, we used the General Equivalence Mappings (GEMs) for ICD-9-CM <-> ICD-10-CM, which can be downloaded from https://www.cms.gov/medicare/coding-billing/icd-10-codes/icd-10-cm-icd-10-pcs-gem-archive.
For ICD-10-AM <-> ICD-11, we use a sequential mapping approach using ICD-10 as the reference version. For instance, we generated ICD-10-AM -> ICD-11 following ICD-10-AM -> ICD-10 -> ICD-11. For ICD-10-AM and ICD-10, we used the mapping files provided by IHACPA (available at: https://www.ihacpa.gov.au/resources/icd-10-am-and-achi-mapping-tables) and for ICD-10 <-> ICD-11 we used the mappings provided by the WHO, which is available at: https://icd.who.int/browse/2025-01/mms/en.
Code/Software
We provide a Jupyter notebook (./script.ipynb) that clearly demonstrates how to load different data files. We ran the scripts using Python-3.12.3. Required packages include pandas and PyTorch.
