Evaluation results of the xMEN entity linking toolkit for multiple benchmark datasets

Borchert, Florian 1 ; Llorca, Ignacio1; Roller, Roland2; Arnrich, Bert1; Schapranow, Matthieu-P.1

Published Dec 21, 2024 on Dryad. https://doi.org/10.5061/dryad.15dv41p6h

Abstract

This dataset contains the benchmark results of the xMEN toolkit for cross-lingual medical entity linking on the following, publicly available benchmark datasets:

Mantra Gold Standard Corpus (multilingual)
Quaero (French)
BRONCO150 (German)
DisTEMIST (Spanish)
MedMentions (English + machine-translated multilingual versions)

For each dataset, we evaluate the default xMEN pipeline with different steps of candidate generation and weakly-supervised and fully-supervised re-ranking on the test sets or 5-fold-cross-validation (for BRONCO150).

Users of xMEN can use these data to compare their own results to the current state-of-the-art performance on these benchmarks, when loaded through the BigBIO library.

https://doi.org/10.5061/dryad.15dv41p6h

Description of the data and file structure

Evaluation of xMEN candidate generation + re-ranking (weakly and fully supervised) on various benchmark datasets.

Files and variables

Each file refers to a subset of a particular benchmark dataset.

For each subset, we run candidate generation + weakly-supervised ([filename]_ws.csv) or fully-supervised ([filename]_fs.csv)

Benchmark	Subset	file_name
Mantra	German	mantra_de
	English	mantra_en
	Spanish	mantra_es
	French	mantra_fr
	Dutch	mantra_nl
Quaero	-	quaero
BRONCO	Diagnoses	bronco_diagnoses
	Medications	bronco_medications
	Treatments	bronco_treatments
DisTEMIST	-	distemist
MedMentions	German	medmentions_de
	English	medmentions_en
	Spanish	medmentions_es
	French	medmentions_fr
	Dutch	medmentions_nl

Variables:

key: step of the xMEN pipeline
- ngram: TF-IDF over character n-grams
- sapbert: Cross-lingual SapBERT
- ensemble: Ensemble of ngram and sapbert
- candidates: Final candidates, i.e., including semantic type filtering if applicable
- cross_encoder: Candidates re-ranked with a (ws or fs) cross-encoder
recall_64: Recall@64, the proportion of ground-truth concepts retrieved among the top-64 predictions
precision_1: Precision@1, the proportion of true positives among the top-1 predictions (accounting for NIL)
recall_1: Recall@1, the proportion of ground-truth concepts retrieved among the top-1 predictions
fscore_1: F1-Score@1, harmonic mean of precision@1 and recall@1

Code/software

Results are generated using the xMEN toolkit (https://github.com/hpi-dhc/xmen). The output is provided as plain CSV.

Access information

Data was derived from the following sources:

https://github.com/bigscience-workshop/biomedical

Evaluation results of the xMEN entity linking toolkit for multiple benchmark datasets

Data files

Abstract

README: xMEN Benchmark Results

Description of the data and file structure

Files and variables

Code/software

Access information

Methods