The governance of health data in the AI era: A scoping review and computational topic model of the global research landscape

Published May 22, 2026 on Dryad. https://doi.org/10.5061/dryad.76hdr7t9s

Data files

May 22, 2026 version files 16.84 KB

JAMIA_OPEN_Appendix.xlsx

13.34 KB
README.md

3.50 KB

Abstract

Background: The integration of artificial intelligence (AI) into healthcare is critically dependent on vast quantities of patient data, igniting an urgent global debate on data ownership, privacy, and governance. While numerous perspectives exist, the empirical structure and evolution of this scholarly discourse remain uncharacterized. We aimed to systematically map the conceptual landscape of research on AI and health data governance to identify its core themes, temporal trends, and key focus areas.

Methods: We conducted a scoping review according to PRISMA guidelines, searching PubMed, Scopus, and Web of Science for peer-reviewed articles published between Jan 1, 2018, and May 31, 2025. We performed a descriptive analysis of publication trends. Using Latent Dirichlet Allocation (LDA), we applied computational topic modelling to the abstracts, which serve as concise summaries of each article's core contributions, to identify latent thematic structures. Topic trends were analyzed using linear regression.

Findings: 43 articles met the inclusion criteria. The volume of publications has increased substantially since 2018. Our LDA analysis identified five distinct research topics: (1) AI Applications & Ownership, (2) AI Models & Data Privacy, (3) Data Sharing Platforms & Technology, (4) Ethical & Legal Concerns, and (5) AI Development & Implementation. Over the study period, research on "Ethical & Legal Concerns" showed a statistically significant increasing trend in prevalence (slope=0.023, p=0.008), becoming the most dominant topic in recent years.

Interpretation:The scholarly discourse on AI and health data has matured, shifting from foundational questions of technical implementation towards a dominant focus on complex ethical and legal challenges. This data-driven evidence signals an urgent need for clinical leaders and policymakers to move beyond theoretical discussions and implement robust, practical governance frameworks. Failure to address this governance gap risks impeding trustworthy AI innovation and eroding public trust, thereby limiting the potential of AI to improve patient outcomes equitably.

The governance of health data in the AI era: a scoping review and computational topic model of the global research landscape

Access this dataset on Dryad

This dataset contains the literature repository utilized for a scoping review and computational topic modeling analysis focused on health data governance. The collection spans across peer-reviewed publications examining the intersection of artificial intelligence, biomedical data stewardship, privacy regulations, and ethical frameworks. The data files map out the primary bibliometric details of the analyzed literature corpus, which served as the foundational source text for generating latent Dirichlet allocation topic distributions, tracking historical publication trajectories, and mapping global research networks.

Description of the data and file structure

The repository consists of the main text manuscript, four high-resolution analytic figures, and an index spreadsheet containing metadata records for the foundational studies identified during the scoping review workflow. This collection serves as the core package for replicating the textual analyses and exploring the included research landscape.

File list

File: JAMIA_OPEN_Appendix.xlsx

Description: Comprehensive literature repository spreadsheet containing forty-three identified baseline publications utilized for the review.

Variables and fields:

id: A unique sequential integer assigned to each identified study within the literature repository, serving as the primary index key.
authors: The full list of contributing researchers or institutional groups credited with publishing the corresponding study.
year: The specific calendar year in which the work was officially published, enabling temporal trend mapping.
title: The complete bibliographic title of the research paper, textbook chapter, or position paper selected for the thematic analysis.

Missing data codes and abbreviations

No data fields are missing or omitted from this curated matrix. There are no specialized abbreviations or acronyms used within the columns that require separate decoding sheets.

Sharing/Access information

The data was gathered, extracted, and structured during the comprehensive literature search and screening phases of the scoping review.

Links to other publicly accessible locations of the data:

https://doi.org/10.5061/dryad.76hdr7t9s

Data was derived from the following sources:

Academic literature indexes including PubMed, Web of Science, and Scopus repositories.

Code/Software

The computational modeling and text processing workflows were performed using specialized data science environments. Latent Dirichlet allocation topic modeling and visualization of the research trends were implemented using Python and R programming environments.

Software environment and packages

Python version 3.10 or higher with packages:
- pandas for data structure management
- numpy for matrix calculations
- gensim for text analysis and latent Dirichlet allocation modeling
- nltk for tokenization and text preprocessing
R version 4.2 or higher for supplementary bibliometric charting and network visualizations.

All spreadsheet files can be viewed and manipulated using standard spreadsheet software such as Microsoft Excel, Google Sheets, or open-source equivalents. Images can be viewed via any standard image viewer tool.