Performance of GPT-4o mini and GPT-4o for medical text mining tasks at different temperature settings
Data files
Jan 13, 2025 version files 6.12 MB
-
rct_oncology_gpt_responses_all.csv
2.99 MB
-
README.md
12.15 KB
-
sample_size_gpt_responses_all.csv
3.12 MB
Abstract
The application of natural language processing (NLP) for extracting data from biomedical research has gained momentum with the advent of large language models (LLMs). However, the effect of different LLM parameters, such as temperature settings, on biomedical text mining remains underexplored and a consensus on what settings can be considered “safe” is missing. This study evaluates the impact of temperature settings on LLM performance for a named entity recognition and a classification task in clinical trial publications. Two datasets that had been annotated as part of previous projects by the author group were used to create tasks for the evaluation of two LLMs, namely Generative Pretrained Transformer 4 Omni (GPT-4o, OpenAI, San Francisco, United States) and GPT-4o mini at nine different temperature settings. The LLMs were first asked to extract the number of people who underwent randomization from the abstract of a publication reporting on a randomized clinical trial (RCT). The second task was to classify an abstract regarding whether or not it was reported on an RCT and/or an oncology topic. The answers of the LLM as well as the ground truth are provided in the dataset.
README: Performance of GPT-4o mini and GPT-4o for medical text mining tasks at different temperature settings
https://doi.org/10.5061/dryad.crjdfn3dt
Description of the data and file structure
Files and variables
Note: Empty cells in the formatted responses mean that the raw response could not be formatted correctly (due to a hallucination of the model when it produced the raw response). As an example, the raw response might contain "746:@"-日本 中 鮬تيييسر" which causes the formatted response to be empty.
File: sample_size_gpt_responses_all.csv
Description: The dataset with the ground truth and the answers by the LLM for extracting the number of people who underwent randomization in each publication (task 1).
Variables
- doi: Digital Object Identifier of the publication
- date: Publication data according to PubMed
- journal: Journal the publication was published in according to PubMed
- title: Title of the publication according to PubMed
- abstract: Abstract of the publication according to PubMed
- Number_randomized: The ground truth (i.e. how many patients were randomized) according to the two human annotators
- GPT-4o-mini_temp000_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 0.00
- GPT-4o-mini_temp025_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 0.25
- GPT-4o-mini_temp050_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 0.50
- GPT-4o-mini_temp075_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 0.75
- GPT-4o-mini_temp100_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 1.00
- GPT-4o-mini_temp125_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 1.25
- GPT-4o-mini_temp150_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 1.50
- GPT-4o-mini_temp175_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 1.75
- GPT-4o-mini_temp200_response_raw: Raw response from GPT-4o mini regarding sample size at temperature 2.00
- GPT-4o-mini_temp000_response: Formatted response from GPT-4o mini regarding sample size at temperature 0.00
- GPT-4o-mini_temp025_response: Formatted response from GPT-4o mini regarding sample size at temperature 0.25
- GPT-4o-mini_temp050_response: Formatted response from GPT-4o mini regarding sample size at temperature 0.50
- GPT-4o-mini_temp075_response: Formatted response from GPT-4o mini regarding sample size at temperature 0.75
- GPT-4o-mini_temp100_response: Formatted response from GPT-4o mini regarding sample size at temperature 1.00
- GPT-4o-mini_temp125_response: Formatted response from GPT-4o mini regarding sample size at temperature 1.25
- GPT-4o-mini_temp150_response: Formatted response from GPT-4o mini regarding sample size at temperature 1.50
- GPT-4o-mini_temp175_response: Formatted response from GPT-4o mini regarding sample size at temperature 1.75
- GPT-4o-mini_temp200_response: Formatted response from GPT-4o mini regarding sample size at temperature 2.00
- GPT-4o_temp000_response_raw: Raw response from GPT-4o regarding sample size at temperature 0.00
- GPT-4o_temp025_response_raw: Raw response from GPT-4o regarding sample size at temperature 0.25
- GPT-4o_temp050_response_raw: Raw response from GPT-4o regarding sample size at temperature 0.50
- GPT-4o_temp075_response_raw: Raw response from GPT-4o regarding sample size at temperature 0.75
- GPT-4o_temp100_response_raw: Raw response from GPT-4o regarding sample size at temperature 1.00
- GPT-4o_temp125_response_raw: Raw response from GPT-4o regarding sample size at temperature 1.25
- GPT-4o_temp150_response_raw: Raw response from GPT-4o regarding sample size at temperature 1.50
- GPT-4o_temp175_response_raw: Raw response from GPT-4o regarding sample size at temperature 1.75
- GPT-4o_temp200_response_raw: Raw response from GPT-4o regarding sample size at temperature 2.00
- GPT-4o_temp000_response: Formatted response from GPT-4o regarding sample size at temperature 0.00
- GPT-4o_temp025_response: Formatted response from GPT-4o regarding sample size at temperature 0.25
- GPT-4o_temp050_response: Formatted response from GPT-4o regarding sample size at temperature 0.50
- GPT-4o_temp075_response: Formatted response from GPT-4o regarding sample size at temperature 0.75
- GPT-4o_temp100_response: Formatted response from GPT-4o regarding sample size at temperature 1.00
- GPT-4o_temp125_response: Formatted response from GPT-4o regarding sample size at temperature 1.25
- GPT-4o_temp150_response: Formatted response from GPT-4o regarding sample size at temperature 1.50
- GPT-4o_temp175_response: Formatted response from GPT-4o regarding sample size at temperature 1.75
- GPT-4o_temp200_response: Formatted response from GPT-4o regarding sample size at temperature 2.00
File: rct_oncology_gpt_responses_all.csv
Description: The dataset with the ground truth and the answers by the LLM for classifying whether each publication reported on a randomized controlled trial and/or an oncology topic (task 2).
Variables
- doi: Digital Object Identifier of the publication
- date: Publication data according to PubMed
- journal: Journal the publication was published in according to PubMed
- title: Title of the publication according to PubMed
- abstract: Abstract of the publication according to PubMed
- RCT_actual: The ground truth (i.e. if the abstract reports on a randomized controlled trial) according to the human annotator
- Oncology_actual: The ground truth (i.e. if the abstract reports on an oncology topic) according to the human annotator
- GPT-4o-mini_rct_onc_temp000_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 0.00
- GPT-4o-mini_rct_onc_temp025_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 0.25
- GPT-4o-mini_rct_onc_temp050_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 0.50
- GPT-4o-mini_rct_onc_temp075_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 0.75
- GPT-4o-mini_rct_onc_temp100_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 1.00
- GPT-4o-mini_rct_onc_temp125_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 1.25
- GPT-4o-mini_rct_onc_temp150_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 1.50
- GPT-4o-mini_rct_onc_temp175_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 1.75
- GPT-4o-mini_rct_onc_temp200_response_raw: Raw response from GPT-4o mini regarding rct/oncology at temperature 2.00
- GPT-4o_rct_onc_temp000_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 0.00
- GPT-4o_rct_onc_temp025_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 0.25
- GPT-4o_rct_onc_temp050_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 0.50
- GPT-4o_rct_onc_temp075_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 0.75
- GPT-4o_rct_onc_temp100_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 1.00
- GPT-4o_rct_onc_temp125_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 1.25
- GPT-4o_rct_onc_temp150_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 1.50
- GPT-4o_rct_onc_temp175_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 1.75
- GPT-4o_rct_onc_temp200_response_raw: Raw response from GPT-4o regarding rct/oncology at temperature 2.00
- GPT-4o_temp000_rct_response: Formatted response from GPT-4o regarding rct at temperature 0.00
- GPT-4o_temp000_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 0.00
- GPT-4o_temp025_rct_response: Formatted response from GPT-4o regarding rct at temperature 0.25
- GPT-4o_temp025_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 0.25
- GPT-4o_temp050_rct_response: Formatted response from GPT-4o regarding rct at temperature 0.50
- GPT-4o_temp050_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 0.50
- GPT-4o_temp075_rct_response: Formatted response from GPT-4o regarding rct at temperature 0.75
- GPT-4o_temp075_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 0.75
- GPT-4o_temp100_rct_response: Formatted response from GPT-4o regarding rct at temperature 1.00
- GPT-4o_temp100_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 1.00
- GPT-4o_temp125_rct_response: Formatted response from GPT-4o regarding rct at temperature 1.25
- GPT-4o_temp125_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 1.25
- GPT-4o_temp150_rct_response: Formatted response from GPT-4o regarding rct at temperature 1.50
- GPT-4o_temp150_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 1.50
- GPT-4o_temp175_rct_response: Formatted response from GPT-4o regarding rct at temperature 1.75
- GPT-4o_temp175_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 1.75
- GPT-4o_temp200_rct_response: Formatted response from GPT-4o regarding rct at temperature 2.00
- GPT-4o_temp200_oncology_response: Formatted response from GPT-4o regarding oncology at temperature 2.00
- GPT-4o-mini_temp000_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 0.00
- GPT-4o-mini_temp000_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 0.00
- GPT-4o-mini_temp025_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 0.25
- GPT-4o-mini_temp025_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 0.25
- GPT-4o-mini_temp050_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 0.50
- GPT-4o-mini_temp050_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 0.50
- GPT-4o-mini_temp075_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 0.75
- GPT-4o-mini_temp075_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 0.75
- GPT-4o-mini_temp100_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 1.00
- GPT-4o-mini_temp100_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 1.00
- GPT-4o-mini_temp125_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 1.25
- GPT-4o-mini_temp125_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 1.25
- GPT-4o-mini_temp150_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 1.50
- GPT-4o-mini_temp150_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 1.50
- GPT-4o-mini_temp175_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 1.75
- GPT-4o-mini_temp175_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 1.75
- GPT-4o-mini_temp200_rct_response: Formatted response from GPT-4o mini regarding rct at temperature 2.00
- GPT-4o-mini_temp200_oncology_response: Formatted response from GPT-4o mini regarding oncology at temperature 2.00
Code/software
The code used to compare the performances of the different temperature settings is provided in the analysis.ipynb file at https://github.com/windisch-paul/temperature/blob/main/analysis.ipynb.
Access information
Other publicly accessible locations of the data:
Methods
Two datasets that had been annotated as part of previous projects by the author group were used to create tasks for the evaluation of two LLMs, namely Generative Pretrained Transformer 4 Omni (GPT-4o, OpenAI, San Francisco, United States) and GPT-4o mini at nine different temperature settings (0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00). The respective versions that were used were gpt-4o-2024-05-13 and gpt-4o-mini-2024-07-18.
The first task was to extract the number of people who underwent randomization from the abstract of a publication reporting on a randomized clinical trial (RCT). To this end, a random sample of 996 randomized controlled trials (RCTs) from seven major journals (British Medical Journal, JAMA, JAMA Oncology, Journal of Clinical Oncology, Lancet, Lancet Oncology, New England Journal of Medicine) published between 2010 and 2022 were labeled. The abstracts were retrieved as a txt file from PubMed and parsed using regular expressions (i.e., expressions that match certain patterns in text). For each trial, the number of randomized trial participants was retrieved by looking at the abstract, followed by the full publication if the number could not be determined with certainty from the abstract. Two physician annotators carried out the annotation independently and conflicts were resolved by discussing the differences afterwards. This dataset is also available on Dryad at https://doi.org/10.5061/dryad.g1jwstr0b.
The LLMs were then called via the application programming interface (API) with the aforementioned temperatures and max_tokens set to 10 to stop the LLM in case of hallucinations. All other API parameters were left at their default. The system prompt was the following: “You will be provided with the abstract of a randomized controlled clinical trial. Your task will be to extract the number of people who underwent randomization. If this number is not explicitly mentioned, you may use other numerical information (e.g. the number of total participants or adding up the number of patients in each arm). Please return only the number as a single integer. If no information is available, please return null."
The user prompt was the respective abstract. The raw responses were stored and afterward, each raw response was converted into an integer unless the conversion failed, e.g. due to the raw response being equal to “null” or due to non-numerical hallucinations.
The second task was to classify an abstract regarding whether or not it was reported on an RCT and/or an oncology topic. To this end, a random sample of 900 publications from the aforementioned seven major journals published between 2010 and 2022 were annotated. Publications that described RCTs received the label “RCT”. Publications that covered oncological topics received the label “ONCOLOGY”. Trials that fulfilled both criteria were assigned both labels. Trials that were neither RCTs nor covered oncology topics were assigned no label. The two labels were chosen as each label poses different requirements to the LLM: For the oncology label, the model does not need a deep contextual understanding but can rather make a prediction based on the presence of certain words that are associated with oncology publications, such as “cancer” or words related to staging and antineoplastic therapies. In order to assign the RCT label, the model can not simply rely on the presence of words and phrases like “randomized” or “primary endpoint” as these might also be present in other articles such as meta-analyses of randomized controlled trials.
Annotation was based on the title and abstract, which were also retrieved as a txt file from PubMed and parsed using regular expressions. Due to the relatively simple annotation process, annotation was carried out by a single physician annotator. This dataset is also available on Dryad at https://doi.org/10.5061/dryad.gb5mkkx00.
The API call to the LLMs used the same settings as for the first task. The user prompt was again the abstract. The system prompt was the following: "You will be provided with the abstract of a medical publication. Your task will be to determine if the abstract reports on a randomized controlled trial. If the abstract reports on a systematic review or meta-analysis of randomized controlled trials or a commentary/editorial, return false. In addition, you will be asked to determine if the abstract focusses on an oncology topic which includes all papers dealing with the prevention, diagnosis or treatment of solid or hematologic cancers. Your response should be a list of two boolean values (True or False), the first indicating if the paper is an RCT and the second indicating if the paper is oncology-related. The list should be enclosed in brackets and separated by a comma, e.g. [True, False]."
The raw responses were stored and afterwards, the two boolean values were extracted unless the extraction failed due to incorrect formatting.