Skip to main content
Dryad

Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links

Data files

Feb 04, 2025 version files 52.77 MB

Abstract

Linking registered clinical trials with their published results continues to be a challenge. A variety of natural language processing (NLP)-based and machine learning-based models have been developed to assist users in identifying these connections. Articles from the PubMed Central full-text collection were scanned for mentions of ClinicalTrials.gov and international clinical trial registry identifiers. We analyzed the distribution of trial registry numbers within sections of the articles and characterized their publication type indexing and other metrics. Three supporting files are included herein: a pdf containing supplementary figures pertaining to the distribution of registry numbers found within the full text of articles, a csv dataset providing the registry numbers discovered and the corresponding XML path location within the document, and an example Python script to locate registry identifiers within an XML article document. It should be noted that the purpose of this study is to summarize clinical trial mentions within publications and specific registries or other nominative information contained in this dataset may contain errors.