Data from: High-fidelity parameter-efficient fine-tuning for joint recognition and linking of diagnoses to ICD-10 in non-standard primary care notes

Estupiñán-Ojeda, Cristian 1 ; Sandomingo-Freire, Raúl J.2; Padró, Lluís2; Turmo, Jordi2

Published Oct 09, 2025 on Dryad. https://doi.org/10.5061/dryad.7m0cfxq8b

Data files

Oct 09, 2025 version files 4.97 GB

bestmodels.zip.001

786.43 MB
bestmodels.zip.002

786.43 MB
bestmodels.zip.003

786.43 MB
bestmodels.zip.004

786.43 MB
bestmodels.zip.005

786.43 MB
bestmodels.zip.006

786.43 MB
bestmodels.zip.007

246.99 MB
README.md

2.94 KB
scripts.zip

20.91 KB

Abstract

Joint recognition and ICD-10 linking of diagnoses in bilingual, non-standard Spanish and Catalan primary-care notes is challenging. We evaluate Parameter-Efficient Fine-Tuning (PEFT) techniques as a resource-conscious alternative to full fine-tuning (FFT) for multi-label clinical text classification. On a corpus of 21,812 Catalan and Spanish clinical notes from Catalonia, we compared the PEFT techniques LoRA, DoRA, LoHA, LoKR, and QLoRA applied to multilingual transformers (BERT, RoBERTa, DistilBERT, mDeBERTa). FFT delivered the best strict Micro-F1 (63.0), but BERT-QLoRA scored 62.2, only 0.8 points lower, while reducing trainable parameters by 67.5% and memory by 33.7%. Training on combined bilingual data consistently improved generalization across individual languages. The small FFT margin was confined to rare labels, indicating limited benefit from updating all parameters. Among PEFT techniques, QLoRA offered the strongest accuracy–efficiency balance; LoRA and DoRA were competitive, whereas LoHA and LoKR incurred larger losses. Adapter rank mattered: ranks below 128 sharply degraded Micro-F1. The substantial memory savings enable deployment on commodity GPUs while delivering performance very close to FFT. PEFT, particularly QLoRA, supports accurate and memory-efficient joint entity recognition and ICD-10 linking in multilingual, low-resource clinical settings.

Dataset DOI: 10.5061/dryad.7m0cfxq8b

Description of the data and file structure

This dataset contains supplementary materials supporting the article “High-Fidelity Parameter-Efficient Fine-Tuning for Joint Recognition and Linking of Diagnoses to ICD-10 in Non-Standard Primary Care Notes” (JAMIA Open, 2025). The files include trained model checkpoints and the corresponding training and evaluation scripts used in the study. These resources were generated through extensive experiments on a corpus of Spanish and Catalan primary care clinical notes. Raw clinical data are not included due to privacy and legal restrictions. The deposited materials enable reproducibility of the reported results, facilitate inspection of model architectures and hyperparameters, and provide code templates that can be adapted to similar datasets.

Files and variables

bestmodels.zip.00:* Contains the checkpoints of the best-performing models, selected according to the training technique (Full Fine-Tuning, LoRA, QLoRA, DoRA, LoHA, LoKr).

bestmodels.zip.001: First part of the split archive containing the best-performing models.

bestmodels.zip.002: Second part of the split archive containing the best-performing models.

bestmodels.zip.003: Third part of the split archive containing the best-performing models.

bestmodels.zip.004: Fourth part of the split archive containing the best-performing models.

bestmodels.zip.005: Fifth part of the split archive containing the best-performing models.

bestmodels.zip.006: Sixth part of the split archive containing the best-performing models.

bestmodels.zip.007: Seventh part of the split archive containing the best-performing models.

scripts.zip: Includes training and validation scripts, optional flags for experimentation, usage examples, and the list of required libraries.

Code/software

Software requirements

The supplementary materials can be used with Python 3.9.13. The following open-source libraries were used in our experiments and are required to reproduce the workflows:

bitsandbytes==0.42.0
cie==0.208
peft==0.14.0
torch==2.1.2+cu118
transformers==4.48.0

All packages are freely available via PyPI or Hugging Face.

Workflow:

Install Python 3.9.13 and the listed dependencies.
Unpack the scripts.zip archive.
Follow the included README to run training or evaluation scripts.
Use bestmodels.zip checkpoints with the provided scripts for inference or fine-tuning.

Access information

Other publicly accessible locations of the Scripts and Models:

https://doi.org/10.5281/zenodo.17025710