Skip to main content
Dryad logo

CORD-19 Software Mentions

Citation

Wade, Alex D.; Williams, Ivana (2021), CORD-19 Software Mentions, Dryad, Dataset, https://doi.org/10.5061/dryad.vmcvdncs0

Abstract

In an effort to automate the process of identifying and analyzing the use of software in biomedical research, we have developed a SciBERT-based machine learning model to extract mentions of software from scientific articles. The input to this model is the full text from a scientific article and the output is a list of mentioned software within it.  We applied this model to the CORD-19 full-text articles and stored the output in this dataset, which includes metadata of over 77,000 COVID-19 and coronavirus-related papers and a list of software tools mentioned in each.

Methods

We have developed a machine learning model to extract mentions of software from scientific articles. The SoftCite dataset was used to train and evaluate the model. This model has been applied to the CORD-19 collection of full-text coronavirus-related research papers. This dataset comprises the output of this model and each scientific article's relevant metadata. 

Data are derived from the CORD-19 dataset provided by AllenAI, release version 2021-02-08 (changelog cord-19_2021-02-08.tar.gz 7.4GB c5446fea 29f69de2) downloaded from AWS on 08-Feb-2021. 

Usage Notes

Notes:

  1. Not all papers in the CORD-19 dataset mention software. We only include here the subset of articles for which there was full-text and which also had at least one detected software mention
  2. Software names have not been normalized, nor have they been resolved to any external dictionary.  e.g. the list of software mentions includes “excel”, “microsoft excel”, “ms excel”, and “office excel”.   
  3. Dataset contains DOIs for each mentioning paper where available (96% of papers). External identifiers (such as PubMed Central IDs, PubMed PMIDs, and arXiv IDs) for the remainder of papers can often be imputed from the paper URLs, e.g. the arXiv ID for the paper with the URL "https://arxiv.org/pdf/2011.09270v1.pdf" is "2011.09270"  
  4. In some cases, mentions of software are incorrectly separated into multiple tokens, e.g. ['scikit', 'learn']

Schema:

Column

Type

Description

paper_id
string
ID of paper from CORD-19 dataset. 40-character sha1 of the PDF
doi
string
Digital Object Identifier of the article, from CORD-19
title
string
Title of the article, from CORD-19
source_x
array
Provenance of the article from CORD-19 dataset, 
e.g. arXiv, bioRxiv, Elsevier, Medline, PMC, WHO, Wiley 
license
string
License of the article, from CORD-19
publish_time  
date (mm/dd/yyyy)  
Publication date of the article, from CORD-19
journal
string
Journal short name, from CORD-19 (e.g. PLoS Compu Biol)
url
array
URL(s) of article, from CORD-19
software
array
Software mentions extracted from article full-text

Funding

Chan Zuckerberg Initiative