In an effort to automate the process of identifying and analyzing the use of software in biomedical research, we have developed a SciBERT-based machine learning model to extract mentions of software from scientific articles. The input to this model is the full text from a scientific article and the output is a list of mentioned software within it. We applied this model to the CORD-19 full-text articles and stored the output in this dataset, which includes metadata of over 77,000 COVID-19 and coronavirus-related papers and a list of software tools mentioned in each.

We have developed a machine learning model to extract mentions of software from scientific articles. The SoftCite dataset was used to train and evaluate the model. This model has been applied to the CORD-19 collection of full-text coronavirus-related research papers. This dataset comprises the output of this model and each scientific article's relevant metadata.

Data are derived from the CORD-19 dataset provided by AllenAI, release version 2021-02-08 (changelog cord-19_2021-02-08.tar.gz 7.4GB c5446fea 29f69de2) downloaded from AWS on 08-Feb-2021.

Notes:

Not all papers in the CORD-19 dataset mention software. We only include here the subset of articles for which there was full-text and which also had at least one detected software mention
Software names have not been normalized, nor have they been resolved to any external dictionary. e.g. the list of software mentions includes “excel”, “microsoft excel”, “ms excel”, and “office excel”.
Dataset contains DOIs for each mentioning paper where available (96% of papers). External identifiers (such as PubMed Central IDs, PubMed PMIDs, and arXiv IDs) for the remainder of papers can often be imputed from the paper URLs, e.g. the arXiv ID for the paper with the URL "https://arxiv.org/pdf/2011.09270v1.pdf" is "2011.09270"
In some cases, mentions of software are incorrectly separated into multiple tokens, e.g. ['scikit', 'learn']

Schema:

Column	Type	Description
paper_id	string	ID of paper from CORD-19 dataset. 40-character sha1 of the PDF
doi	string	Digital Object Identifier of the article, from CORD-19
title	string	Title of the article, from CORD-19
source_x	array	Provenance of the article from CORD-19 dataset, e.g. arXiv, bioRxiv, Elsevier, Medline, PMC, WHO, Wiley
license	string	License of the article, from CORD-19
publish_time	date (mm/dd/yyyy)	Publication date of the article, from CORD-19
journal	string	Journal short name, from CORD-19 (e.g. PLoS Compu Biol)
url	array	URL(s) of article, from CORD-19
software	array	Software mentions extracted from article full-text

CORD-19 Software Mentions

Data files

Abstract

CORD-19 Software Mentions

Data files

Abstract

Methods

Usage notes