A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data
Data files
Jun 04, 2024 version files 122.70 KB
Abstract
Objective:
Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings.
Materials and Methods:
A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis.
Results:
ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the efficacy of ChatGPT 4 varied significantly across different genetic conditions, with specific differences identified between responses related to BRCA1 and HFE.
Discussion and Conclusion:
This study highlights ChatGPT 4's potential in genomics, noting significant advancements over its predecessor. Despite these improvements, challenges remain, including the risk of outdated information and the necessity of ongoing refinement. The variability in performance across different genetic conditions underscores the need for expert oversight and continuous AI training. ChatGPT 4, while showing promise, emphasizes the importance of balancing technological innovation with ethical responsibility in healthcare information delivery.
README: A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data
https://doi.org/10.5061/dryad.s4mw6m9cv
This data was captured when evaluating the ability of ChatGPT to address questions patients may ask it about three genetic conditions (BRCA1, HFE, and MLH1). This data is associated with the JAMIA article of the similar name with the DOI 10.1093/jamia/ocae128
Description of the data and file structure
- Key: This tab contains the data structure, explaining the survey questions, and potential responses available.
- Prompt Responses: This tab contains the prompts used for ChatGPT, and the response provided from each model (3.5 and 4)
- GPT 4 Results: This tab provides the responses collected from the medical experts (genetic counselors and clinical geneticist) from the Qualtrics survey.
- Accuracy (Qx_1): This tab contains the subset of results from both the ChatGPT 3.5 paper (pre-print) and the ChatGPT 4 paper in response to quality of the answer given to the initial prompt.
- Relevancy (Qx_2): This tab contains the subset of results from the ChatGPT 4 answers asking about answer relevancy.
- BUS-15: This tab contains the results from the Bot Usability Score (BUS-15) questions. Seven of the total 15 questions were used.
Sharing/Access information
Data was derived from the following sources:
- Survey instrument built on the Qualtrics platform
Code/Software
The data was exported from Qualtrics into Excel.
Methods
Study Design
This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023)
Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses.
Questionnaire Development
The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed.
The initial questions focused on quality and answer relevancy:
1. The overall quality of the Chatbot’s response is: (5-point Likert: Very poor to Very Good)
2. The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree)
The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on:
1. Recognition and facilitation of users’ goal and intent: Chatbot seems able to recognize the user’s intent and guide the user to its goals.
2. Relevance of information: The chatbot provides relevant and appropriate information/answer to people at each stage to make them closer to their goal.
3. Maxim of quantity: The chatbot responds in an informative way without adding too much information.
4. Resilience to failure: Chatbot seems able to find ways to respond appropriately even when it encounters situations or arguments it is not equipped to handle.
5. Understandability and politeness: The chatbot seems able to understand input and convey correct statements and answers without ambiguity and with acceptable manners.
6. Perceived conversational credibility: The chatbot responds in a credible and informative way without adding too much information.
7. Meet the neurodiverse needs: Chatbot seems able to meet needs and be used by users independently form their health conditions, well-being, age, etc.
Expert Panel and Data Collection
A panel of experts (two genetic counselors and two clinical geneticists) was provided with a link to the survey containing the questions. They independently evaluated the responses from ChatGPT 4 without discussing the questions or answers among themselves until after the survey submission. This approach ensured unbiased evaluation.