Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs

Shen, Elaine1 2; Hamati, Fadi1 2; Donohue, Meghan Rose3; Girgis, Ragy1 4; Veenstra-VanderWeele, Jeremy1 4; Jutla, Amandeep 1 4

Published Nov 19, 2025 on Dryad. https://doi.org/10.5061/dryad.x0k6djj00

Data files

Nov 19, 2025 version files 108.24 KB

llm_psychosis_numeric_ratings.csv

105.59 KB
README.md

2.65 KB

Abstract

The large language models (LLM) "chatbot" product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input and generate encouraging responses, they may have difficulty appropriately responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional, experimental study of how multiple versions of the ChatGPT product respond to psychotic and control prompts, with blind clinician ratings of response appropriateness. We found that all three tested versions of ChatGPT were much more likely to generate inappropriate responses to psychotic than control prompts, with the "Free" product showing the poorest performance. In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.

Dataset DOI: 10.5061/dryad.x0k6djj00

Description of the data and file structure

This dataset contains numerical ratings of prompt-response pairs from our study, and can be used to reproduce our analyses. Note that the literal text of prompts and model responses are not provided here, but they are available from the corresponding author on reasonable request.

Files and variables

File: llm_psychosis_numeric_ratings.csv

Description: This CSV file contains all numeric appropriateness ratings assigned to prompt-response pairs in a "long" format. The 1592 rows represent 474 ratings each from two primary raters (for 948 from both), 474 derived consensus ratings, and 170 ratings from a secondary rater. The seven columns are described below.

Variables

pair_id: The ID of the prompt-response pair rated. Possible values are 1 to 474 (158 prompts, 79 control and 79 psychotic, each presented to 3 versions of the ChatGPT product). This uniquely identifies a specific prompt shown to a specific product version.
prompt_id: The ID of the prompt used to elicit the response. Possible values are 1 to 158 (158 prompts, 79 psychotic and 79 control). Each ID appears three times in the dataset, once per product version tested.
positive_symptom_domain: The SIPS positive symptom domain reflected by the prompt used. Five possible values: unusual_thought_content_delusions, suspiciousness_persecutory_ideas, grandiose_ideas, perceptual_abnormalities_hallucinations, and disorganized communication.
condition: Whether the prompt used featured psychotic content. Two possible values: psychosis and control.
model: The version of the ChatGPT product to which the prompt was presented to elicit a response. Three possible values: chatgpt_free, chatgpt_4o, and chatgpt_5_auto.
rater: The rater for the prompt-response pair. Four possible values: r1 (the secondary rater who rated 170 of the 474 prompt-response pairs), r2 (who rated all pairs), r3 (who rated all pairs), and consensus (the floor of the median of r2 and r3 for all pairs, and the value used in our analyses).
rating: The "appropriateness" rating that was assigned to the prompt-response pair. Three possible values: 0 (completely appropriate), 1 (somewhat appropriate), and 2 (completely inappropriate).

Code/software

The CSV file can be read and our analyses reproduced with R or similar software.