Randomized controlled clinical trials with tagged information regarding the number of participants

Windisch, Paul 1 ; Zwahlen, Daniel R.1

Published Jul 21, 2024; Updated Sep 16, 2024 on Dryad. https://doi.org/10.5061/dryad.g1jwstr0b

Data files

Jul 21, 2024 version files 40.07 MB

Sep 16, 2024 version files 40.07 MB

Abstract

Background:

Extracting the sample size from randomized controlled trials (RCTs) remains a challenge to developing better search functionalities or automating systematic reviews. Most current approaches rely on the sample size being explicitly mentioned in the abstract.

Data collection:

A random sample of 996 randomized controlled trials (RCTs) from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2010 and 2022 were labeled. To do so, abstracts were retrieved as a txt file from PubMed and parsed using regular expressions (i.e., expressions that match certain patterns in text). For each trial, the number of people who were randomized was retrieved by looking at the abstract, followed by the full publication if the number could not be determined with certainty from the abstract. In addition, six different entities were tagged in each abstract, independent of whether the information was presented using words or integers. If the number of people who were randomized was explicitly stated (e.g., using the words “randomly,” “randomized,” etc.), this was tagged as “RANDOMIZED_TOTAL.” If the number of people who were analyzed was presented, this was tagged as “ANALYSIS_TOTAL”. If the number of people who completed the trial or a certain follow-up period was presented, this was tagged as “COMPLETION_TOTAL. If the number of people who were part of the trial without being more specific was presented, this was tagged as “GENERAL_TOTAL”. If the number of people who were assigned to an arm of the trial was presented, this was tagged as “ARM”. Lastly, if the number of patients who were assigned to an arm was presented in the context of how many patients experienced an event, this was tagged as “ARM_EVENT”. If the abstract did not contain the aforementioned entities, the manuscript was added to the dataset without any tags.

Data properties:

Each trial is a row in the csv file. For a detailed description, please have a look at the enclosed Readme file.

https://doi.org/10.5061/dryad.g1jwstr0b

For each trial, the number of people who were randomized was retrieved by looking at the abstract, followed by the full publication if the number could not be determined with certainty from the abstract.

In addition, six different entities were tagged in each abstract, independent of whether the information was presented using words or integers. If the number of people who were randomized was explicitly stated (e.g., using the words “randomly,” “randomized,” etc.), this was tagged as “RANDOMIZED_TOTAL.” If the number of people who were analyzed was presented, this was tagged as “ANALYSIS_TOTAL”. If the number of people who completed the trial or a certain follow-up period was presented, this was tagged as “COMPLETION_TOTAL. If the number of people who were part of the trial without being more specific was presented, this was tagged as “GENERAL_TOTAL”. If the number of people who were assigned to an arm of the trial was presented, this was tagged as “ARM”. Lastly, if the number of patients who were assigned to an arm was presented in the context of how many patients experienced an event, this was tagged as “ARM_EVENT”. As a hypothetical example, in the sentence “50 of 200 people in the intervention arm and 20 of 203 people in the control arm experienced treatment-related toxicity”, 200 and 203 would be tagged as ARM_EVENT. If the abstract did not contain the aforementioned entities, the manuscript was added to the dataset without any tags.

150 annotated examples were randomly assigned to an unseen test set. The remaining 846 examples were used to train and validate a named entity recognition model. However, the sets could be merged to allow for different splits.

Description of the data and file structure

doi: Digital Object Identifier of the publication
date: Publication data according to PubMed
journal: Journal the publication was published in according to PubMed
title: Title of the publication according to PubMed
text: The text that was displayed to the annotator (i.e., the abstract)
tokens: The list of tokens that the "text" column was parsed into. Each token has a text, a start, an end, an id, and a boolean ws property that indicates if a token is follows by whitespace or not.
spans: The spans created by the annotator. Each span has a start, and end, a text, a source (if the span had been suggested by a model in the loop), an input_hash, a token_start, a token_end, and a label.
answer: Contains "accept" if an annotator annotated the publication.
*input*hash. *task*hash, *is*binary, **timestamp, *annotator*id, *session_*id: Additional columns automatically created by the annotation tool during the annotation process. For detailed documentation see https://prodi.gy/docs/api-components(opens in new window)(opens in new window)
Number_randomized: The ground truth (i.e. how many patients were randomized)

Sharing/Access information

The data is also available from a github repository

https://github.com/windisch-paul/sample_size_extraction

Code/Software

The python script that was used to train the model referenced in the associated publications is available as "analysis.ipynb" from the repository (https://github.com/windisch-paul/sample_size_extraction).