Mining the health disparities and minority health bibliome: A computational scoping review and gap analysis of 200,000+ articles
Data files
Jan 22, 2024 version files 515.86 MB
-
hdmh_abstracts.csv
-
README.md
-
text.out
Abstract
Without comprehensive examination of available literature on health disparities and minority health (HDMH), the field is left vulnerable to disproportionately focus on specific populations or conditions, curtailing our ability to fully advance health equity. Using scalable open-source methods, we conducted a computational scoping review of more than 200,000 articles to investigate major populations, conditions, and themes in the literature as well as notable gaps. We also compared trends in studied conditions to their relative prevalence in the general population using insurance claims (42 million Americans). HDMH publications represent 1% of articles in MEDLINE. Most studies are observational in nature, though randomized trial reporting has increased five-fold in the last twenty years. Half of all HDMH articles concentrate on only three disease groups (cancer, mental health, endocrine/metabolic disorders), while hearing, vision, and skin-related conditions are among the least well-represented despite substantial prevalence. To support further investigation, we also present HDMH Monitor, an interactive dashboard and repository generated from the HDMH bibliome.
README
This is a repository to accompany the publication, Mining the Health Disparities and Minority Health Bibliome: A Computational Scoping Review and Gap Analysis of 200,000+ Articles, in Science Advances (doi: 10.1126/sciadv.adf9033) and its associated website, HDMH Monitor.
General Information
Dataset:
Mining the health disparities and minority health bibliome: A computational scoping review and gap analysis of 200,000+ articles [Dataset]. Dryad. https://doi.org/10.5061/dryad.vhhmgqp10
Corresponding Author Information
Name: Harry Reyes Nieva
Institution: Columbia University
Address: New York, NY, USA
Email: harry.reyes@columbia.edu
Date of data collection:
25 August 2021
Funding sources:
National Library of Medicine (NLM) under Award Numbers T15LM007079 and R01LM013043 in addition to a Computational and Data Science Fellowship to Harry Reyes Nieva from the Association for Computing Machinery Special Interest Group in High Performance Computing (ACM SIGHPC).
Description of the data and file structure
DATA & FILE OVERVIEW
File List:
- hdmh_abstracts.csv
- text.out
DATA-SPECIFIC INFORMATION FOR: hdmh_abstracts.csv
The dataset was collected by querying the MEDLINE application programming interface (API). For each article, we extracted the PubMed reference number (PMID) and associated metadata using the open-source Entrez package of the Biopython Python library. We then used latent Dirichlet allocation (LDA) to model 50 topics from concatenated article title and abstract text.
Variable List:
- Index
- PMID: PubMed ID (PMID)
- Year: Year of publication
- Article_Type: Publication type (e.g., journal article)
- Language: Language of article
- Title: Article title
- Abstract: Article abstract
- MeSH: Medical Subject Headings of article
- Dominant_Topic: Dominant topic number among 50 possible topics generated by the topic model
- Topic_Perc_Contrib: Percent contribution of the dominant topic of the article among 50 possible topics
- Topic_Name_No: Dominant topic name (i.e., assigned label)
DATA-SPECIFIC INFORMATION FOR: text.out
The text.out dataset is a file that was produced by Batch MetaMap, an open-source biomedical literature tool designed and maintained by the National Library of Medicine for recognizing Unified Medical Language System (UMLS) concepts via natural language processing, when we used the tool to process concatenated article title and abstract text from the articles in hdmh_abstracts.csv.
Detailed explanation of fielded MetaMap Indexing (MMI) output is available at: https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/MMI_Output_2016.pdf. Last accessed: 20 Jan 2024.
Fields:
- PubMed ID (PMID)
- MetaMap Indexing (MMI)
- MMI Score
- UMLS Concept Preferred Name
- UMLS Concept Unique Identifier (CUI)
- Semantic Type List
- Trigger Information
- Location
- Positional information
- Treecode(s)
Sharing/Access information
Links to other publicly accessible locations of the data:
Data was derived from the following sources:
- MEDLINE