Skip to main content
Dryad

Randomized controlled clinical trials with tagged information regarding the number of participants

Data files

Jul 21, 2024 version files 40.07 MB
Sep 16, 2024 version files 40.07 MB

Abstract

Background:

Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract.

Data collection:

A random sample of 996 randomized controlled trials (RCTs) from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2010 and 2022 were labeled. To do so, abstracts were retrieved as a txt file from PubMed and parsed using regular expressions (i.e., expressions that match certain patterns in text). For each trial, the number of people who were randomized was retrieved by looking at the abstract, followed by the full publication if the number could not be determined with certainty from the abstract. In addition, six different entities were tagged in each abstract, independent of whether the information was presented using words or integers. If the number of people who were randomized was explicitly stated (e.g., using the words “randomly,” “randomized,” etc.), this was tagged as “RANDOMIZED_TOTAL.” If the number of people who were analyzed was presented, this was tagged as “ANALYSIS_TOTAL”. If the number of people who completed the trial or a certain follow-up period was presented, this was tagged as “COMPLETION_TOTAL. If the number of people who were part of the trial without being more specific was presented, this was tagged as “GENERAL_TOTAL”. If the number of people who were assigned to an arm of the trial was presented, this was tagged as “ARM”. Lastly, if the number of patients who were assigned to an arm was presented in the context of how many patients experienced an event, this was tagged as “ARM_EVENT”. If the abstract did not contain the aforementioned entities, the manuscript was added to the dataset without any tags.

Data properties:

Each trial is a row in the csv file. For a detailed description, please have a look at the enclosed Readme file.