Randomized controlled oncology trials with tumor stage inclusion criteria
Data files
Jun 22, 2024 version files 4.56 MB
-
metastatic_local_test.csv
775.04 KB
-
metastatic_local_train.csv
3.78 MB
-
README.md
3.11 KB
Dec 04, 2024 version files 4.58 MB
-
metastatic_local_test.csv
778.15 KB
-
metastatic_local_train.csv
3.80 MB
-
README.md
3.26 KB
Abstract
Background:
Extracting inclusion and exclusion criteria in a structured, automated fashion remains a challenge to developing better search functionalities or automating systematic reviews of randomized controlled trials in oncology. The question “Did this trial enroll patients with localized disease, metastatic disease, or both?” could be used to narrow down the number of potentially relevant trials when conducting a search.
Dataset collection:
600 randomized controlled trials from high-impact medical journals were classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits.
Data properties:
Each trial is a row in the csv file. For each trial there is a doi, a publication date, a title, an abstract, the abstract sections (introduction, methods, results, conclusion), several tags associated with the annotation process (text, _input_hash, _task_hash, options, _view_id, config, accept, answer, _timestamp, _annotator_id,_session_id), and the assigned labels (answer).
https://doi.org/10.5061/dryad.g4f4qrfzn
600 randomized controlled oncology trials from high-impact medical journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2005 and 2023 were randomly sampled and classified depending on whether they allowed for the inclusion of patients with localized and/or metastatic disease. The dataset was randomly split into a training/validation and a test set of 500 and 100 trials respectively. However, the sets could be merged to allow for different splits.
Description of the data and file structure
Each trial is a row in the csv file. For each trial there are the follwing columns:
- doi: Digital Object Identifier of the trial
- date: Publication data according to PubMed
- title: Title of the trial according to PubMed
- abstract: The abstract of the trial
- abstract_introduction: The introduction section of the abstract. Parsed from the PubMed abstract using regular expressions.
- abstract_methods: The methods section of the abstract. Parsed from the PubMed abstract using regular expressions. Note that sometimes this section might have a different name in the journal itself (e.g. “design, setting, and participants”).
- abstract_results: The results section of the abstract. Parsed from the PubMed abstract using regular expressions.
- abstract_conclusions The conclusion section of the abstract. Parsed from the PubMed abstract using regular expressions.
- text: The text that was displayed to the annotator.
- accept: Contains “accept” if an annotator annotated the trial.
- answer: The labels that the annotator assigned. Trials that allowed for the inclusion of patients with localized disease received the label “LOCAL”. Trials that allowed for the inclusion of patients with metastatic disease received the label “METASTATIC”. Trials that allowed for the inclusion of patients with either localized or metastatic disease received bot labels. Screening trials that enrolled patients without known cancer or trials of interventions to prevent cancer were assigned no label.
- options: List of labels that the annotator could select
- _input_hash, _task_hash, _view_id, config, _timestamp, _annotator_id, _session_id: Additional columns automatically created by the annotation tool during the annotation process. For detailed documentation see https://prodi.gy/docs/api-components.
Sharing/Access information
The data is also available from a github repository
Code/Software
The python script that was used to train the model referenced in the associated publications is available as “analysis.ipynb” from the repository (https://github.com/windisch-paul/metastatic_vs_local).
Version changes
2024-12-03: Corrected annotations on ~1% of rows after additional review had been carried out as part of a follow-up project.
Randomized controlled oncology trials from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2005 and 2023 were randomly sampled and annotated with the labels “LOCAL”, “METASTATIC”, both or none. Trials that allowed for the inclusion of patients with localized disease received the label “LOCAL”. Trials that allowed for the inclusion of patients with metastatic disease received the label “METASTATIC”. Trials that allowed for the inclusion of patients with either localized or metastatic disease received bot labels. Screening trials that enrolled patients without known cancer or trials of interventions to prevent cancer were assigned no label. Trials of tumor entities where the distinction between localized and metastatic disease is usually not made (e.g., hematologic malignancies) were skipped. Annotation was based on the title and abstract. If those were inconclusive, the full text of the publication was evaluated. If this was not conclusive either, the registration or the protocol were evaluated. Annotation was performed by a single author (P.W.) using the tool prodigy (v. 1.13.1). 600 trials were annotated.