Skip to main content

A web-based tool for automatically linking clinical trials to their publications - example calculation

Cite this dataset

Smalheiser, Neil; Holt, Arthur (2022). A web-based tool for automatically linking clinical trials to their publications - example calculation [Dataset]. Dryad.


Objective: Evidence synthesis teams, physicians, policy makers, and patients and their families all have an interest in following the outcomes of clinical trials and would benefit from being able to evaluate both the results posted in trial registries and in the publications that arise from them. Manual searching for publications arising from a given trial is a laborious and uncertain process. We sought to create a statistical model to automatically identify PubMed articles likely to report clinical outcome results from each registered trial in

Materials and Methods: A machine learning-based model was trained on pairs (publications linked to specific registered trials). Multiple features were constructed based on the degree of matching between the PubMed article metadata and specific fields of the trial registry, as well as matching with the set of publications already known to be linked to that trial.

Results: Evaluation of the model using NCT-linked articles as gold standard showed that they tend to be top ranked (median best rank = 1.0), and 91% of them are ranked in the top ten.

Discussion: Based on this model, we have created a free, public web based tool at that, given any registered trial in, presents a ranked list of the PubMed articles in order of estimated probability that they report clinical outcome data from that trial. The tool should greatly facilitate studies of trial outcome results and their relation to the original trial designs.


Two datasets have been included that contain the literal lists of NCT numbers and PubMed PMIDs for clinical trials and articles that were used in our model evalution samples. Both samples were selected randomly from the full list of ~380,000 trials available on or about May 3, 2021. The two files are:

TrialSample_Exactly1NCTlinked.csv - contains the 300 trials with searched articles that, on or about May 3, 2021, had only one known clean "NCT-linked" article as described in our manuscript. There are 1,190,544 data rows with 1 header row.

TrialSample_2orMoreNCTlinked.csv - contains the 300 trials with searched articles that, on or about May 3, 2021, had 2 or more clean "NCT-linked" articles as described in our manuscript. There are 1,150,544 data rows with 1 header row.

Both datasets are comma-delimited files with headers and contain the following columns:

NCT_num - the trial NCT number as registered with
PMID - the article identifier from
prob - score as calculated by our Trials-to-Publication link model. Values range from 0 to 1, with 1 being most likely linked.
NCTlinkreal - Indicator that the article is "
NCT-linked" as described in our manuscript and considered our "gold standard"
Reslinkreal - Indicator that the article is listed in the Trial's results reference record, and considered our "silver standard"
rank - integer rank by descending prob, top (or most likely) rank is 1 per trial

The software archive contains the necessary clinical trial data, PubMed article data, lookup tables and python scripts needed to calculate the model described in our manuscript for a single preselected clinical trial. Whereas the web-based tool retrieves trial and article data from live databases that are updated daily from and, this example calculation contains static snapshots of the needed information stored in Python pickle files.

The included files are as follows:

Archive name:
dataset_01.pkl - dataset_24.pkl: static copies of trial and article data needed for model computation
adam.pkl: list of medline abbreviations described by Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818.
countries.pkl: list of standardized country names
nicknames.pkl: list of common nicknames for human names
stoplist.pkl and stoplist.txt: list of approx. 300 common stop words stoplist1000.txt: list of approx. 1000 stop words for use in article abstract searching
stoplistmesh.pkl: list of MeSH terms that are excluded from clinical trial searching main script that will run the example scoring with the provided data script that contains functions that are used for model computation
scoreout_NCT03745053_data_original.csv: example output from the computation program(s)
README.txt: basic instructions for running the code

Usage notes


NCT_num should be treated as a string and will match the NCT numbers from  PMID may be treated as integer and will match PMID from  As both and are live databases, it is possible some trials or articles may substantially change or be deleted in the future.  These samples were drawn on or about May 3, 2021.

Example Code:

The README.txt file describes the requisite computing environment needed to run the example calculation.  The general steps needed to perform the example model application are:

* Extract all files in zip to the same folder in a Linux environment with Python3.6 available
* cd <to the folder where the files were extracted>
* execute:

* The program will output a data file named "scoreout_NCT03745053_data.csv".  This should match exactly the provided "scoreout_NCT03745053_data_original.csv" file.


National Institute on Aging, Award: P01AG039347