Medical publications with information as to whether a publication reports a randomized controlled trial and/or if it covers an oncology topic

Published Jul 13, 2024 on Dryad. https://doi.org/10.5061/dryad.gb5mkkx00

Data files

Jul 13, 2024 version files 5.21 MB

Abstract

Background:

Most tools trying to automatically extract information from medical publications are domain agnostic and process publications from any field. However, only retrieving trials from dedicated fields could have advantages for further processing of the data.

Dataset collection:

A random sample of 900 publications from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2010 and 2022 were annotated. Publications that described randomized controlled trials (RCTs) received the label “RCT”. Publications that covered oncological topics received the label “ONCOLGY”. Trials that fulfilled both criteria were assigned both labels. Trials that were neither RCTs nor covered oncology topics were assigned no label. 100 randomly sampled trials from the New England Journal of Medicine were used as the unseen test set as the journal publishes both oncology and non-oncology articles.

Data properties:

Each trial is a row in the CSV file. For each trial, there is a doi, a publication date, a title, an abstract, the abstract sections (introduction, methods, results, conclusion), several tags associated with the annotation process (text, _input_hash, _task_hash, options, _view_id, config, accept, answer, _timestamp, _annotator_id,_session_id), and the assigned labels (answer).

https://doi.org/10.5061/dryad.gb5mkkx00

A random sample of 900 publications from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2010 and 2022 were annotated. Publications that described randomized controlled trials (RCTs) received the label “RCT”. Publications that covered oncological topics received the label “ONCOLGY”. Trials that fulfilled both criteria were assigned both labels. Trials that were neither RCTs nor covered oncology topics were assigned no label. For the purpose of this prototype, publications on benign tumors such as uterine fibroids were considered oncology publications, due to the similarity of terminology. Annotation was based on the title and abstract, which were retrieved as a txt file from PubMed and parsed using regular expressions.

100 randomly sampled trials from the New England Journal of Medicine were used as the unseen test set as the journal publishes both oncology and non-oncology articles. We decided against taking a random sample of all trials as the test set since the model might learn properties of the oncology-focused journals (JAMA Oncology, Lancet Oncology, and Journal of Clinical Oncology) during training. This would improve the performance of the model on the test set but does not generalize to the real-world application with the model not knowing beforehand if a journal is a dedicated oncology journal. However, the sets could be merged to allow for different splits.

Description of the data and file structure

The dataset includes two CSV files: rct_oncology test and train

doi: Digital Object Identifier of the publication
date: Publication data according to PubMed
journal: Journal the publication was published in according to PubMed
abstract: The abstract of the publication
text: The text that was displayed to the annotator.
accept: Contains "accept" if an annotator annotated the publication.
answer: The labels that the annotator assigned. Publications that described randomized controlled trials (RCTs) received the label “RCT”. Publications that covered oncological topics received the label “ONCOLGY”. Trials that fulfilled both criteria were assigned both labels. Trials that were neither RCTs nor covered oncology topics were assigned no label.
options: List of labels that the annotator could select
_input_hash, _task_hash, _view_id, config, _timestamp, _annotator_id, _session_id: Additional columns automatically created by the annotation tool during the annotation process. For detailed documentation see https://prodi.gy/docs/api-components(opens in new window).

Sharing/Access information

The data is also available from a github repository

https://github.com/windisch-paul/oncology_pipeline

Code/Software

The python script that was used to train the model referenced in the associated publications is available as "analysis.ipynb" from the repository (https://github.com/windisch-paul/oncology_pipeline).