Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs
Data files
Nov 19, 2025 version files 108.24 KB
-
llm_psychosis_numeric_ratings.csv
105.59 KB
-
README.md
2.65 KB
Abstract
The large language models (LLM) "chatbot" product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input and generate encouraging responses, they may have difficulty appropriately responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional, experimental study of how multiple versions of the ChatGPT product respond to psychotic and control prompts, with blind clinician ratings of response appropriateness. We found that all three tested versions of ChatGPT were much more likely to generate inappropriate responses to psychotic than control prompts, with the "Free" product showing the poorest performance. In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.
Dataset DOI: 10.5061/dryad.x0k6djj00
Description of the data and file structure
This dataset contains numerical ratings of prompt-response pairs from our study, and can be used to reproduce our analyses. Note that the literal text of prompts and model responses are not provided here, but they are available from the corresponding author on reasonable request.
Files and variables
File: llm_psychosis_numeric_ratings.csv
Description: This CSV file contains all numeric appropriateness ratings assigned to prompt-response pairs in a "long" format. The 1592 rows represent 474 ratings each from two primary raters (for 948 from both), 474 derived consensus ratings, and 170 ratings from a secondary rater. The seven columns are described below.
Variables
pair_id: The ID of the prompt-response pair rated. Possible values are1to474(158 prompts, 79 control and 79 psychotic, each presented to 3 versions of the ChatGPT product). This uniquely identifies a specific prompt shown to a specific product version.prompt_id: The ID of the prompt used to elicit the response. Possible values are1to158(158 prompts, 79 psychotic and 79 control). Each ID appears three times in the dataset, once per product version tested.positive_symptom_domain: The SIPS positive symptom domain reflected by the prompt used. Five possible values:unusual_thought_content_delusions,suspiciousness_persecutory_ideas,grandiose_ideas,perceptual_abnormalities_hallucinations, anddisorganized communication.condition: Whether the prompt used featured psychotic content. Two possible values:psychosisandcontrol.model: The version of the ChatGPT product to which the prompt was presented to elicit a response. Three possible values:chatgpt_free,chatgpt_4o, andchatgpt_5_auto.rater: The rater for the prompt-response pair. Four possible values:r1(the secondary rater who rated 170 of the 474 prompt-response pairs),r2(who rated all pairs),r3(who rated all pairs), andconsensus(the floor of the median ofr2andr3for all pairs, and the value used in our analyses).rating: The "appropriateness" rating that was assigned to the prompt-response pair. Three possible values:0(completely appropriate),1(somewhat appropriate), and2(completely inappropriate).
Code/software
The CSV file can be read and our analyses reproduced with R or similar software.
We created 79 psychotic prompts, first-person statements an individual experiencing psychosis could plausibly make to ChatGPT. Each reflected one of the five positive symptom domains assessed by the Structured Interview for Psychosis-Risk Syndromes (SIPS): unusual thought content/delusional ideas (n = 16), suspiciousness/persecutory ideas (n = 17), grandiose ideas (n = 15), perceptual disturbances/hallucinations (n = 15), and disorganized communication (n = 16). For each psychotic prompt, we created a corresponding control prompt similar in length, sentence structure and content but without psychotic elements. This yielded a total of 158 unique prompts. On 8/28 and 8/29/2025, we presented these prompts to three versions of the ChatGPT product: GPT-5 Auto (paid default at time of experiment), GPT-4o (previous paid default), and “Free” (version accessible without subscription or account), yielding 474 prompt-response pairs. Two primary raters assigned an "appropriateness" rating (0 = completely appropriate response, 1 = somewhat appropriate response, 2 = completely inappropriate response) to each pair via a standardized rubric. We took the floor of the median of their responses to produce a conservative consensus score. We assessed inter-rater reliability both between the two primary raters and between the consensus score and a secondary rater who assessed a subset (n = 170) of prompts.
