Randomized controlled oncology trials with tumor stage inclusion criteria

Windisch, Paul 1 ; Zwahlen, Daniel R.1

Published Jun 22, 2024; Updated Dec 04, 2024 on Dryad. https://doi.org/10.5061/dryad.g4f4qrfzn

Data files

Jun 22, 2024 version files 4.56 MB

Dec 04, 2024 version files 4.58 MB

Abstract

Background:

Extracting inclusion and exclusion criteria in a structured, automated fashion remains a challenge to developing better search functionalities or automating systematic reviews of randomized controlled trials in oncology. The question “Did this trial enroll patients with localized disease, metastatic disease, or both?” could be used to narrow down the number of potentially relevant trials when conducting a search.

Dataset collection:

600 randomized controlled trials from high-impact medical journals were classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits.

Data properties:

Each trial is a row in the csv file. For each trial there is a doi, a publication date, a title, an abstract, the abstract sections (introduction, methods, results, conclusion), several tags associated with the annotation process (text, _input_hash, _task_hash, options, _view_id, config, accept, answer, _timestamp, _annotator_id,_session_id), and the assigned labels (answer).

https://doi.org/10.5061/dryad.g4f4qrfzn

600 randomized controlled oncology trials from high-impact medical journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2005 and 2023 were randomly sampled and classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits.

Description of the data and file structure

Each trial is a row in the csv file. For each trial there are the follwing columns:

doi: Digital Object Identifier of the trial
date: Publication data according to PubMed
title: Title of the trial according to PubMed
abstract: The abstract of the trial
abstract_introduction: The introduction section of the abstract. Parsed from the PubMed abstract using regular expressions.
abstract_methods: The methods section of the abstract. Parsed from the PubMed abstract using regular expressions. Note that sometimes this section might have a different name in the journal itself (e.g. “design, setting, and participants”).
abstract_results: The results section of the abstract. Parsed from the PubMed abstract using regular expressions.
abstract_conclusions The conclusion section of the abstract. Parsed from the PubMed abstract using regular expressions.
text: The text that was displayed to the annotator.
accept: Contains “accept” if an annotator annotated the trial.
answer: The labels that the annotator assigned. Trials that allowed for the inclusion of patients with localized disease received the label “LOCAL”. Trials that allowed for the inclusion of patients with metastatic disease received the label “METASTATIC”. Trials that allowed for the inclusion of patients with either localized or metastatic disease received bot labels. Screening trials that enrolled patients without known cancer or trials of interventions to prevent cancer were assigned no label.
options: List of labels that the annotator could select
_input_hash, _task_hash, _view_id, config, _timestamp, _annotator_id, _session_id: Additional columns automatically created by the annotation tool during the annotation process. For detailed documentation see https://prodi.gy/docs/api-components.

Sharing/Access information

The data is also available from a github repository

https://github.com/windisch-paul/metastatic_vs_local

Code/Software

The python script that was used to train the model referenced in the associated publications is available as “analysis.ipynb” from the repository (https://github.com/windisch-paul/metastatic_vs_local).

Version changes

2024-12-03: Corrected annotations on ~1% of rows after additional review had been carried out as part of a follow-up project.