RadCases evaluation results: Evaluating acute image ordering for real-world patient cases via language model alignment with radiological guidelines

Yao, Michael 1 ; Chae, Allison1; Saraiya, Piya1; Kahn Jr., Charles1; Witschey, Walter1; Gee, James1; Sagreiya, Hersh1; Bastani, Osbert1

Published Jul 23, 2025 on Dryad. https://doi.org/10.5061/dryad.p8cz8wb0b

Data files

Jul 23, 2025 version files 21.16 MB

radGPT-LLM-Evaluation.zip
21.15 MB
README.md
3.15 KB

Abstract

Background: Diagnostic imaging studies are increasingly important in the management of acutely presenting patients. However, ordering appropriate imaging studies in the emergency department is a challenging task with a high degree of variability between healthcare providers. To address this issue, recent work has investigated whether generative AI and large language models can be leveraged to recommend diagnostic imaging studies in accordance with evidence-based medical guidelines. However, it remains challenging to ensure that these tools can provide recommendations that correctly align with medical guidelines, especially given the limited diagnostic information available in acute care settings.

Methods: In this study, we introduce a framework to intelligently leverage language models by recommending imaging studies for patient cases that are aligned with the American College of Radiology’s Appropriateness Criteria, a set of evidence-based guidelines. To power our experiments, we make available RadCases, a novel dataset of over 1500 annotated case summaries reflecting common patient presentations, and apply our framework to enable state-of-the-art language models to reason about appropriate imaging choices.

Results: We leverage our framework to enable state-of-the-art language models to achieve an accuracy on par with clinicians in image ordering. Furthermore, we demonstrate that our language model-based pipeline can be used as an intelligent assistant by clinicians to support image ordering workflows and improve the accuracy of acute image ordering according to the American College of Radiology’s Appropriateness Criteria.

Conclusions: Our work demonstrates and validates a strategy to leverage AI-based software to improve trustworthy clinical decision making in alignment with expert evidence-based guidelines.

Dataset DOI: 10.5061/dryad.p8cz8wb0b

Description of the Data and File Structure

The ZIP dataset contains all of the raw experimental results for our main experiments. Each individual file in the dataset is a JSON Lines that contains the prediction made by the language model as well as the ground truth answer. Each line in a JSON Lines file represents one patient case evaluation. One JSON Lines file represents one experimental evaluation using one partition of the RadCases dataset, one language model, and one random seed. All raw outputs were generated using our custom source code.### Files and variables

File: radGPT-LLM-Evaluation.zip

The unzipped dataset is organized as follows (not all sub-directories are shown for brevity):

radGPT-LLM-Evaluation/
├── Baseline/
│   ├── ClaudeSonnet/
│   │   ├── Panels/
│   │   │   ├── BIDMC/
│   │   │   │   ├── 42.jsonl
│   │   │   │   ├── 43.jsonl
│   │   │   │   ├── 44.jsonl
│   │   │   │   ├── 45.jsonl
│   │   │   │   └── 46.jsonl
│   │   │   ├── JAMA
│   │   │   ├── NEJM
│   │   │   ├── Synthetic
│   │   │   └── USMLE
│   │   ├── Studies
│   │   └── Topics
│   ├── CommandRPlus
│   ├── DBRXInstruct
│   ├── GPT4oMini
│   ├── GPT4Turbo
│   ├── Llama3Instruct
│   └── Mistral8x7BInstruct
├── ChainOfThought
├── FineTuning
├── InContextLearning
└── RetrievalAugmentedGeneration

More explicitly, the first sublevel (i.e., Baseline, ChainOfThought, FineTuning, InContextLearning, RetrievalAugementedGeneration) indicates the prompting strategy. The second sublevel (i.e., ClaudeSonnet, CommandRPlus, etc) indicates the LLM used. The third sublevel (i.e., Panels, Studies, Topics) indicates the evaluation metric used (i.e., ACR AC Panel prediction accuracy, imaging study prediction accuracy, or ACR AC Topic prediction accuracy, respectively). The fourth sublevel (i.e., BIDMC, JAMA, etc) indicates the RadCases dataset subset. The file name indicates the random seed used for the experiment.

Code/Software

Installation

To install and run our code, first clone the radGPT repository.

git clone https://github.com/michael-s-yao/radGPT
cd radGPT

Next, create a conda environment and install the relevant dependencies. All software versions and dependencies used in our experiments are documented in the environment.yml file in our source code.

conda env create -f environment.yml
conda activate radgpt

After successful setup, you can run our code as

python main.py --help

Contact

Questions and comments are welcome. Contact information is linked below.

Michael Yao

Osbert Bastani