Skip to main content
Dryad

Evaluation of large language model chatbot responses to psychotic prompts: numerical ratings of prompt-response pairs

Data files

Nov 19, 2025 version files 108.24 KB

Click names to download individual files

Abstract

The large language models (LLM) "chatbot" product ChatGPT has accumulated 800 million weekly users since its 2022 launch. In 2025, several media outlets reported on individuals in whom apparent psychotic symptoms emerged or worsened in the context of using ChatGPT. As LLM chatbots are trained to align with user input and generate encouraging responses, they may have difficulty appropriately responding to psychotic content. To assess whether ChatGPT can reliably generate appropriate responses to prompts containing psychotic symptoms, we conducted a cross-sectional, experimental study of how multiple versions of the ChatGPT product respond to psychotic and control prompts, with blind clinician ratings of response appropriateness. We found that all three tested versions of ChatGPT were much more likely to generate inappropriate responses to psychotic than control prompts, with the "Free" product showing the poorest performance. In an exploratory analysis, prompts reflecting grandiosity or disorganized communication were more likely to elicit inappropriate responses than those reflecting delusions.