Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links
Data files
Feb 04, 2025 version files 52.77 MB
-
Distribution_of_Trial_Registry_Numbers_ScanOutput.zip
52.77 MB
-
README.md
4.85 KB
Abstract
Linking registered clinical trials with their published results continues to be a challenge. A variety of natural language processing (NLP)-based and machine learning-based models have been developed to assist users in identifying these connections. Articles from the PubMed Central full-text collection were scanned for mentions of ClinicalTrials.gov and international clinical trial registry identifiers. We analyzed the distribution of trial registry numbers within sections of the articles and characterized their publication type indexing and other metrics. Three supporting files are included herein: a pdf containing supplementary figures pertaining to the distribution of registry numbers found within the full text of articles, a csv dataset providing the registry numbers discovered and the corresponding XML path location within the document, and an example Python script to locate registry identifiers within an XML article document. It should be noted that the purpose of this study is to summarize clinical trial mentions within publications and specific registries or other nominative information contained in this dataset may contain errors.
README: Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links
https://doi.org/10.5061/dryad.dbrv15fb1
This data set contains a table with every combination of publication ID, registry number, XML path, and section of the publication discovered in the Full-Text scanning of PubMed Central articles.
Description of the data and file structure
Distribution_of_Trial_Registry_Numbers_Additional_File.pdf
This document contains charts and summaries of the trial registry numbers found from the XML document scanning process. The explicit criteria for locating registry identifiers and designating article sections are provided in this document and may be useful for further research and refinement.
Distribution_of_Trial_Registry_Numbers_ScanOutput.zip
This zip archive contains a comma-separated file named "xmlScanOutput.csv" that contains all rows of registry numbers and article identifiers found during the full-text scanning process. This file has a total of 4,039,972 rows, including the header row. The header row contains column names, all of which are enclosed in double quotes ("). Data for the columns indicated as Characters below are enclosed in double quotes, however, Integer and Float columns are not. Lines are terminated with a linux-style line feed (\n). Columns are described below. While registry identifiers from ClinicalTrials.gov were confirmed to match existing trials at the time of analysis, no attempt was made to validate non-U.S. registry identifiers. This dataset is a snapshot in time as of on or about December 15, 2024. Missing data, designated with n/a, should be interpreted as data not being available at the time of our analysis, rather than affirmative of the absence of any particular publication attribute. This dataset may enable detailed analysis of specific clinical trials or further investigation of international registries as they relate to biomedical publications by providing the document identifiers, registry identifiers, and document XML path nomenclature discovered in our research.
Columns:
line - Integer - unique identifier for each line. These are not necessarily sequential integers starting at 1.
AccessionID - Character - article identifier from PubMed Central
pmid - Integer - article identifier from PubMed
XMLpath - Character - the full XML path from each section of full-text documents
section - Character - generalized section derived from the full XML path.
RegistryNumber - Character - the trial registry number found in the text of the section
isNCT - Integer - 0,1 indicator if the registries are an NCT number from clinicaltrials.gov
NLMPT - Character - list of publication types as indexed from NLM, separated by "|". Missing values are marked with n/a and indicate articles for which NLM did not provide publication-type indexing.
MTPT - Character - list of publication types as predicted from the MultiTagger model, separated by "|". Missing values are market with n/a and indicate cases where the model did not produce a prediction.
TPscore - Float - Trials-to-Publication model score, n/a if no score is available.
Country - Character - Country or Region associated with the registries column
Distribution_of_Trial_Registry_Numbers_xmlSearchCode.py
This Python script contains the XML search function used to scan the full-text XML files from PubMed Central.
Sharing/Access information
Data was derived from the following sources:
Code/Software
Below is an example of a possible way to load the data file into a MySQL 5.7 database. Modifications may be necessary for other database platforms and installations.
Example MySQL table definition:
CREATE TABLE `PMCDATA` (
`line` int(11) NOT NULL,
`AccessionID` varchar(100) DEFAULT NULL,
`pmid` int(11) DEFAULT NULL,
`XMLpath` varchar(255) DEFAULT NULL,
`section` varchar(100) DEFAULT NULL,
`RegistryNumber` text,
`isNCT` tinyint(3) unsigned DEFAULT NULL,
`NLMPT` text,
`MTPT` text,
`TPscore` text DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Example MySQL load statement:
LOAD DATA LOCAL INFILE "xmlScanOutput.csv"
INTO TABLE PMCDATA
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Test Query:
select `section`, count(distinct pmid)
FROM PMCDATA
where isNCT=1 and pmid > 0
group by `section` ;
/* results: these counts match the # of unique articles depicted in
Fig 1. Count of Unique and Total NCT Number Mentions Discovered by Section
Article-Metadata 66923
Conclusions 3687
Introduction 22800
Methods 96254
Other 134137
Results 6648
Table 25376
*/
Methods
These datasets and files are the results of scanning 6,901,686 XML documents within the Pubmed Central Open Access article datasets available at: https://ftp.ncbi.nlm.nih.gov/pub/pmc/
Each registry identifier match is represented by a row in the xmlScanOutput.csv file, along with PubMed identifiers, file information, XML path information, and several computed columns including a validation that an NCT number exists within ClinicalTrials.gov, a generalized article section, and publication types from multiple indexing sources. Summaries within the Distribution_of_Trial_Registry_Numbers_Additional_File.pdf were generated by counting distinct PMID values within the csv file across various groups.