CZ Software Mentions: A large dataset of software mentions in the biomedical literature - Expanded 2024
Data files
Nov 12, 2024 version files 9.86 GB
-
mentions.zip
9.86 GB
-
README.md
11.41 KB
Abstract
We release a dataset of software mentions in open access biomedical papers published in bioRxv, medRxiv, or stored in euroPMC. The mentions are extracted with a trained BERT model. The dataset provides sources, context, metadata, software, and links.
This is a continuation of our previous dataset, https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c, based on an expanded set of papers
Authors
Ana-Maria Istrate,
James Bartolome,
Fabrizio Castrotorres,
Ellaine Chou,
Donghui Li,
Dario Taraborelli,
Michaela Torkar,
Boris Veytsman,
Ivana Williams.
Chan Zuckerberg Initiative, https://chanzuckerberg.com
Summary
We release a dataset of software mentions in open access biomedical
papers published in bioRxv, medRxiv, or stored in euroPMC. The
mentions are extracted with a trained BERT model. The dataset provides
sources, context, metadata, software, and links.
Code of Conduct
This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.
Reporting Security Issues
If you believe you have found a security issue, please responsibly disclose by contacting us at security@chanzuckerberg.com.
Introduction
The previous edition of large software mentions dataset was published in 2022 CZ Software Mentions: A large dataset of software mentions in the biomedical literature and covered 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 4 million papers in the Publishers’ collection.
This editions adds mentions in the open access papers from the following sources:
Source | Last updated | Number of unique DOI |
---|---|---|
bioRxiv | 2023/08/23 | 198247 |
medRxiv | 2023/08/24 | 43404 |
euroPMC | 2023/06/12 | 5289607 |
External documentation
The full documentation is published as a separate preprint at https://arxiv.org/abs/2209.00693.
Full code can be found at https://github.com/chanzuckerberg/software-mentions.
Zenodo ID for code is at https://zenodo.org/record/7041594#.Yxd2guzMI0Q
Extraction, linking and disambiguation
We run a BERT-based NER model on our corpus to extract plain-text software mentions. The model has been trained on the SoftCite dataset, uses the SciBERT model architecture and has an F1 score of 0.922. More details about this model can be found at https://github.com/chanzuckerberg/software-mention-extraction. We performed linking and disambiguation based on the procedures discussed in our paper https://arxiv.org/abs/2209.00693
Dataset Description
The files in the dataset the extension tsv
are tab separated, files
with the extension csv
are comma separated, files with the suffix
pkl
are Python serialized objects, files with the suffix gz
are
gzipped. Note that tab separated files may contain embedded quotes,
which do not have special meaning, while in comma separated files they
do have meaning in line with the usual conventions.
model_output
The files in the model_output
directory contain the results of the NER model.
The fields are:
- sentence_id
-
a unique id for the sentence
- doi
-
paper DOI
- pmc_id
-
paper PMCID, if any,
- sentence
-
sentence from which the software mention is extracted
- source
-
paper source (europe_pmc, biorxiv, medrxiv)
- software
-
the software extracted
- mention_id
-
the unique id for the menion
Disambiguation files
This directory contains the results of disambiguation. The subdirectory synonyms_files
contains the following files
- pypi_synonyms.pkl
-
Python dictionary mapping from a PyPI package and an array of
synonyms generated through the Keywords Synonym Generation process - cran_synonyms.pkl
-
Python dictionary mapping from a CRAN package and an array of
synonyms generated through the Keywords Synonym Generation process - bioconductor_synonyms.pkl
-
Python dictionary mapping from a Bioconductor package and an array
of synonyms generated through the Keywords Synonym Generation
process - scicrunch_synonyms.pkl
-
Python dictionary mapping from a mention found in SciCrunch and its
synonyms retrieved through the SciCrunch API - extra_scicrunch_synonyms.pkl
-
Python dictionary mapping from a mention found in SciCrunch and its
synonyms retrieved by parsing the corresponding URL in SciCrunch - string_similarity_synonyms.pkl
-
Python dictionary mapping from a mention found in the
comm_IDs.tsv.gz
corpus and its synonyms retrieved through the
Jaro Winkler algorithm, together with the corresponding
confidences. Only pairs of synonyms with a similarity confidence
of >=0.9 are kept. The file has the following format:
- software_mention - mention synonyms are computed for
- (\[synonyms\], \[synonyms_confidences\]) - tuple containing two
arrays:
- synonyms: list of synonyms for software_mention
- synonyms_confidences: Jaro Winkler similarity scores between
synonyms and software_mention, as given by the textdistance
python package
- synonyms_NN.csv.gz
-
comma separated gzipped files used as input for the clustering
algorithm, after post-processing; the files are built by combining
information frompypi_synonyms.pkl
,cran_synonyms.pkl
,
bioconductor_synonyms.pkl
,scicrunch_synonyms.pkl
,
extra_scicrunch_synonyms.pkl
andstring_similarity_synonyms.pkl
and performing additional clean-up. The file has the following
format:
- **software_mention** software_mention to compute synonyms for
- **synonyms** a synonym for software_mention
- **confidences** confidence for this synonym pair
- **source** source for this synonym pair
The subdirectory final_output
contains the disambiguated data in the comma separated gzipped format. The fields are
- sentence_id
-
the id of the sentence
- doi
-
Paper DOI
- pmc_id
-
PMC ID, if any
- sentence
-
the sentence from which the software mention is extracted
- source
-
paper source (europe_pmc, biorxiv, medrxiv)
- software
-
the software extracted
- mention_id
-
the unique id for the menion
- disambiguated_software
-
the disambiguated software name
Intermediate Files
The directory intermediate_files
contains the following files:
mention2ID.pkl
and freq_dict.pkl
The file mention2ID.pkl
is a mention to ID mapping
connecting all the plain text software mentions extracted by the NER
algorithm. The file is in the pickle format and the data is stored as
a Python dictionary. The keys are plain-text software mentions across
all three datasets, and the values are unique IDs across all three
corpora.
- mention
-
plain-text software mention
- ID
-
unique ID for the plain-text software mention
The file freq_dict.pkl
is a mention to frequency mapping
connecting all plain text software mentions extracted by the NER
algorithm. We define frequency as the total number of unique papers a
mention appears in in the dataset. The file is in the pickle format
and the data is stored as a Python dictionary. The keys are plain-text
software mentions, and the values are the corresponding frequencies.
- mention
-
plain-text software mention
- frequency
-
total number of unique papers in the corpus the
mention appears in
The file mention2ID.pkl
is a mention to ID mapping
connecting all the plain text software mentions extracted by the NER
algorithm to a unique ID. The file is in the pickle format and the
data is stored as a Python dictionary. The keys are plain-text
software mentions, and the values are unique
IDs.
Linking Results
The linked
directory contains the directories: normalized
and raw
and the file metadata.tsv.gz
.
Raw Metadata Files
The raw
directory contains raw metadata files obtained by querying
the PyPI, CRAN, Bioconductor, SciCrunch and GitHub APIs on mentions
extracted by the NER algorithm. The directory contains the following
comma separated files:
- bioconductor_raw_df.csv.gz
-
raw metadata file obtained by querying Bioconductor on mentions
extracted from thecomm.tsv.gz
. Has fields:
- Bioconductor Package
- BioConductor Link
- Title
- Maintainer
- cran_raw_df.csv.gz
-
raw metadata file obtained by querying CRAN on mentions extracted
from thecomm.tsv.gz
. Has fields:
- CRAN Package
- CRAN Link
- Title
- github_raw_df.csv.gz
-
raw metadata file obtained by querying GitHub on mentions extracted
from thecomm.tsv.gz
. Has fields:
- software_id
- software_name
- github__repo_id
- github_url
- description
- created_at
- stars
- issues
- pypi_raw_df.csv.gz
-
raw metadata file obtained by querying PyPI on mentions extracted
from thecomm.tsv.gz
. Has fields:
- pypi package
- pypi_url
- scicrunch_raw_df.csv.gz
-
raw metadata file obtained by querying SciCrunch on mentions
extracted from thecomm.tsv.gz
. Has fields:
- software_name
- scicrunch_synonyms
- Resource Name
- Resource Name Link
- Description
- Keywords
- Resource ID
- Resource ID Link
- Proper Citation
- Parent Organization
- Parent Organization Link
- Related Condition
- Funding Agency
- Relation
- Reference
- Website Status
- Alternate IDs
- Alternate URLs
- Old URLs
- Reference Link
Normalized Metadata Files
The normalized
directory contains normalized versions of the raw
metadata files. Files are normalized to a common schema. The directory
contains the following comma separated files:
bioconductor_df.csv.gz
,
cran_df.csv.gz
,
github_df.csv.gz
,
pypi_df.csv.gz
,
scicrunch_df.csv.gz
.
Master Metadata File
The metadata.tsv.gz
file is a concatenation of all the metadata files
in the normalized
directory. Each file in the normalized
directory,
as well as the metadata.tsv.gz
file has the following fields:
- ID
-
a unique identifier for each software,
- software_mention
-
the canonical string for the given software mention,
- mapped_to
-
list of values to which the software was mapped,
- source
-
mapping source (PyPI, CRAN, SciCrunch, GitHub, Bioconductor),
- platform
-
list of platforms for the given software,
- package_url
-
URL for the given package,
- description
-
list of descriptions associated with software in the database,
- homepage_url
-
list of homepages for the software,
- other_urls
-
list of other URLs for the given software mined from the database,
- license
-
list of licenses under which the software is released,
- github_repo
-
list of GitHub repositories for the software,
- github_repo_licenses
-
list of licenses listed on the GitHub repositories,
- exact_match
-
True
if an exact string match was found for the given software
mention,False
if a fuzzy match was used instead, - RRID
-
RRID for the software retrieved from SciCrunch
- reference
-
journal articles linked to the software, identified by DOI, PMID or
RRID, - scicrunch_synonyms
-
synonyms for software according to SciCrunch
We used the following corpus of papers:
Source | Last updated | Number of unique DOIs |
bioRxiv | 2023/08/23 | 198247 |
medRxiv | 2023/08/24 | 43404 |
europePMC | 2023/06/12 | 5289607 |
We run a BERT-based NER model on our corpus to extract plain-text software mentions. The model has been trained on the SoftCite dataset, uses the SciBERT model architecture and has an F1 score of 0.922. More details about this model can be found at https://github.com/chanzuckerberg/software-mention-extraction. We performed linking and disambiguation based on the procedures discussed in our paper https://arxiv.org/abs/2209.00693