CORD-19 Software Mentions
Data files
Mar 04, 2021 version files 31.88 MB
-
CORD19_software_mentions.csv
31.88 MB
Abstract
In an effort to automate the process of identifying and analyzing the use of software in biomedical research, we have developed a SciBERT-based machine learning model to extract mentions of software from scientific articles. The input to this model is the full text from a scientific article and the output is a list of mentioned software within it. We applied this model to the CORD-19 full-text articles and stored the output in this dataset, which includes metadata of over 77,000 COVID-19 and coronavirus-related papers and a list of software tools mentioned in each.
Methods
We have developed a machine learning model to extract mentions of software from scientific articles. The SoftCite dataset was used to train and evaluate the model. This model has been applied to the CORD-19 collection of full-text coronavirus-related research papers. This dataset comprises the output of this model and each scientific article's relevant metadata.
Data are derived from the CORD-19 dataset provided by AllenAI, release version 2021-02-08 (changelog cord-19_2021-02-08.tar.gz 7.4GB c5446fea 29f69de2) downloaded from AWS on 08-Feb-2021.
Usage notes
Notes:
- Not all papers in the CORD-19 dataset mention software. We only include here the subset of articles for which there was full-text and which also had at least one detected software mention
- Software names have not been normalized, nor have they been resolved to any external dictionary. e.g. the list of software mentions includes “excel”, “microsoft excel”, “ms excel”, and “office excel”.
- Dataset contains DOIs for each mentioning paper where available (96% of papers). External identifiers (such as PubMed Central IDs, PubMed PMIDs, and arXiv IDs) for the remainder of papers can often be imputed from the paper URLs, e.g. the arXiv ID for the paper with the URL "https://arxiv.org/pdf/2011.09270v1.pdf" is "2011.09270"
- In some cases, mentions of software are incorrectly separated into multiple tokens, e.g. ['scikit', 'learn']
Schema:
Column |
Type |
Description |
paper_id |
string |
ID of paper from CORD-19 dataset. 40-character sha1 of the PDF |
doi |
string |
Digital Object Identifier of the article, from CORD-19 |
title |
string |
Title of the article, from CORD-19 |
source_x |
array |
Provenance of the article from CORD-19 dataset, e.g. arXiv, bioRxiv, Elsevier, Medline, PMC, WHO, Wiley |
license |
string |
License of the article, from CORD-19 |
publish_time |
date (mm/dd/yyyy) |
Publication date of the article, from CORD-19 |
journal |
string |
Journal short name, from CORD-19 (e.g. PLoS Compu Biol) |
url |
array |
URL(s) of article, from CORD-19 |
software |
array |
Software mentions extracted from article full-text |