CZ Software Mentions: A large dataset of software mentions in the biomedical literature - Expanded 2024

Istrate, Ana-Maria1 ; Bartolome, James1 ; Castrotorres, Fabrizio1 ; Chou, Ellaine1 ; Li, Donghui1 ; Taraborelli, Dario1 ; Torkar, Michaela1 ; Veytsman, Boris 1 ; Williams, Ivana1

Published Nov 12, 2024 on Dryad. https://doi.org/10.5061/dryad.zgmsbccjk

Data files

Nov 12, 2024 version files 9.86 GB

mentions.zip
9.86 GB
README.md
11.41 KB

Abstract

We release a dataset of software mentions in open access biomedical papers published in bioRxv, medRxiv, or stored in euroPMC. The mentions are extracted with a trained BERT model. The dataset provides sources, context, metadata, software, and links.

This is a continuation of our previous dataset, https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c, based on an expanded set of papers

Authors

Ana-Maria Istrate,
James Bartolome,
Fabrizio Castrotorres,
Ellaine Chou,
Donghui Li,
Dario Taraborelli,
Michaela Torkar,
Boris Veytsman,
Ivana Williams.

Chan Zuckerberg Initiative, https://chanzuckerberg.com

Summary

We release a dataset of software mentions in open access biomedical
papers published in bioRxv, medRxiv, or stored in euroPMC. The
mentions are extracted with a trained BERT model. The dataset provides
sources, context, metadata, software, and links.

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.

Reporting Security Issues

If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.

Introduction

The previous edition of large software mentions dataset was published in 2022 CZ Software Mentions: A large dataset of software mentions in the biomedical literature and covered 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 4 million papers in the Publishers’ collection.

This editions adds mentions in the open access papers from the following sources:

Source	Last updated	Number of unique DOI
bioRxiv	2023/08/23	198247
medRxiv	2023/08/24	43404
euroPMC	2023/06/12	5289607

External documentation

The full documentation is published as a separate preprint at https://arxiv.org/abs/2209.00693.
Full code can be found at https://github.com/chanzuckerberg/software-mentions.
Zenodo ID for code is at https://zenodo.org/record/7041594#.Yxd2guzMI0Q

Extraction, linking and disambiguation

We run a BERT-based NER model on our corpus to extract plain-text software mentions. The model has been trained on the SoftCite dataset, uses the SciBERT model architecture and has an F1 score of 0.922. More details about this model can be found at https://github.com/chanzuckerberg/software-mention-extraction. We performed linking and disambiguation based on the procedures discussed in our paper https://arxiv.org/abs/2209.00693

Dataset Description

The files in the dataset the extension tsv are tab separated, files
with the extension csv are comma separated, files with the suffix
pkl are Python serialized objects, files with the suffix gz are
gzipped. Note that tab separated files may contain embedded quotes,
which do not have special meaning, while in comma separated files they
do have meaning in line with the usual conventions.

model_output

The files in the model_output directory contain the results of the NER model.

The fields are:

sentence_id: a unique id for the sentence
doi: paper DOI
pmc_id: paper PMCID, if any,
sentence: sentence from which the software mention is extracted
source: paper source (europe_pmc, biorxiv, medrxiv)
software: the software extracted
mention_id: the unique id for the menion

Disambiguation files

This directory contains the results of disambiguation. The subdirectory synonyms_files contains the following files

pypi_synonyms.pkl: Python dictionary mapping from a PyPI package and an array of
synonyms generated through the Keywords Synonym Generation process
cran_synonyms.pkl: Python dictionary mapping from a CRAN package and an array of
synonyms generated through the Keywords Synonym Generation process
bioconductor_synonyms.pkl: Python dictionary mapping from a Bioconductor package and an array
of synonyms generated through the Keywords Synonym Generation
process
scicrunch_synonyms.pkl: Python dictionary mapping from a mention found in SciCrunch and its
synonyms retrieved through the SciCrunch API
extra_scicrunch_synonyms.pkl: Python dictionary mapping from a mention found in SciCrunch and its
synonyms retrieved by parsing the corresponding URL in SciCrunch
string_similarity_synonyms.pkl: Python dictionary mapping from a mention found in the
comm_IDs.tsv.gz corpus and its synonyms retrieved through the
Jaro Winkler algorithm, together with the corresponding
confidences. Only pairs of synonyms with a similarity confidence
of >=0.9 are kept. The file has the following format:

-   software_mention - mention synonyms are computed for

-   (\[synonyms\], \[synonyms_confidences\]) - tuple containing two
    arrays:

    -   synonyms: list of synonyms for software_mention

    -   synonyms_confidences: Jaro Winkler similarity scores between
        synonyms and software_mention, as given by the textdistance
        python package

synonyms_NN.csv.gz: comma separated gzipped files used as input for the clustering
algorithm, after post-processing; the files are built by combining
information from pypi_synonyms.pkl, cran_synonyms.pkl,
bioconductor_synonyms.pkl, scicrunch_synonyms.pkl,
extra_scicrunch_synonyms.pkl and string_similarity_synonyms.pkl
and performing additional clean-up. The file has the following
format:

-   **software_mention** software_mention to compute synonyms for

-   **synonyms** a synonym for software_mention

-   **confidences** confidence for this synonym pair 

-   **source** source for this synonym pair

The subdirectory final_output contains the disambiguated data in the comma separated gzipped format. The fields are

sentence_id: the id of the sentence
doi: Paper DOI
pmc_id: PMC ID, if any
sentence: the sentence from which the software mention is extracted
source: paper source (europe_pmc, biorxiv, medrxiv)
software: the software extracted
mention_id: the unique id for the menion
disambiguated_software: the disambiguated software name

Intermediate Files

The directory intermediate_files contains the following files:
mention2ID.pkl and freq_dict.pkl

The file mention2ID.pkl is a mention to ID mapping
connecting all the plain text software mentions extracted by the NER
algorithm. The file is in the pickle format and the data is stored as
a Python dictionary. The keys are plain-text software mentions across
all three datasets, and the values are unique IDs across all three
corpora.

mention: plain-text software mention
ID: unique ID for the plain-text software mention

The file freq_dict.pkl is a mention to frequency mapping
connecting all plain text software mentions extracted by the NER
algorithm. We define frequency as the total number of unique papers a
mention appears in in the dataset. The file is in the pickle format
and the data is stored as a Python dictionary. The keys are plain-text
software mentions, and the values are the corresponding frequencies.

mention: plain-text software mention
frequency: total number of unique papers in the corpus the
mention appears in

The file mention2ID.pkl is a mention to ID mapping
connecting all the plain text software mentions extracted by the NER
algorithm to a unique ID. The file is in the pickle format and the
data is stored as a Python dictionary. The keys are plain-text
software mentions, and the values are unique
IDs.

Linking Results

The linked directory contains the directories: normalized and raw
and the file metadata.tsv.gz.

Raw Metadata Files

The raw directory contains raw metadata files obtained by querying
the PyPI, CRAN, Bioconductor, SciCrunch and GitHub APIs on mentions
extracted by the NER algorithm. The directory contains the following
comma separated files:

bioconductor_raw_df.csv.gz: raw metadata file obtained by querying Bioconductor on mentions
extracted from the comm.tsv.gz. Has fields:

-   Bioconductor Package

-   BioConductor Link

-   Title

-   Maintainer

cran_raw_df.csv.gz: raw metadata file obtained by querying CRAN on mentions extracted
from the comm.tsv.gz. Has fields:

-   CRAN Package

-   CRAN Link

-   Title

github_raw_df.csv.gz: raw metadata file obtained by querying GitHub on mentions extracted
from the comm.tsv.gz. Has fields:

-   software_id

-   software_name

-   github__repo_id

-   github_url

-   description

-   created_at

-   stars

-   issues

pypi_raw_df.csv.gz: raw metadata file obtained by querying PyPI on mentions extracted
from the comm.tsv.gz. Has fields:

-   pypi package

-   pypi_url

scicrunch_raw_df.csv.gz: raw metadata file obtained by querying SciCrunch on mentions
extracted from the comm.tsv.gz. Has fields:

-   software_name

-   scicrunch_synonyms

-   Resource Name

-   Resource Name Link

-   Description

-   Keywords

-   Resource ID

-   Resource ID Link

-   Proper Citation

-   Parent Organization

-   Parent Organization Link

-   Related Condition

-   Funding Agency

-   Relation

-   Reference

-   Website Status

-   Alternate IDs

-   Alternate URLs

-   Old URLs

-   Reference Link

Normalized Metadata Files

The normalized directory contains normalized versions of the raw
metadata files. Files are normalized to a common schema. The directory
contains the following comma separated files:
bioconductor_df.csv.gz,
cran_df.csv.gz,
github_df.csv.gz,
pypi_df.csv.gz,
scicrunch_df.csv.gz.

Master Metadata File

The metadata.tsv.gz file is a concatenation of all the metadata files
in the normalized directory. Each file in the normalized directory,
as well as the metadata.tsv.gz file has the following fields:

ID: a unique identifier for each software,
software_mention: the canonical string for the given software mention,
mapped_to: list of values to which the software was mapped,
source: mapping source (PyPI, CRAN, SciCrunch, GitHub, Bioconductor),
platform: list of platforms for the given software,
package_url: URL for the given package,
description: list of descriptions associated with software in the database,
homepage_url: list of homepages for the software,
other_urls: list of other URLs for the given software mined from the database,
license: list of licenses under which the software is released,
github_repo: list of GitHub repositories for the software,
github_repo_licenses: list of licenses listed on the GitHub repositories,
exact_match: True if an exact string match was found for the given software
mention, False if a fuzzy match was used instead,
RRID: RRID for the software retrieved from SciCrunch
reference: journal articles linked to the software, identified by DOI, PMID or
RRID,
scicrunch_synonyms: synonyms for software according to SciCrunch

Source	Last updated	Number of unique DOIs
bioRxiv	2023/08/23	198247
medRxiv	2023/08/24	43404
europePMC	2023/06/12	5289607