Skip to main content
Dryad

Performance of GPT-4o mini and GPT-4o for medical text mining tasks at different temperature settings

Data files

Jan 13, 2025 version files 6.12 MB

Abstract

The application of natural language processing (NLP) for extracting data from biomedical research has gained momentum with the advent of large language models (LLMs). However, the effect of different LLM parameters, such as temperature settings, on biomedical text mining remains underexplored and a consensus on what settings can be considered “safe” is missing. This study evaluates the impact of temperature settings on LLM performance for a named entity recognition and a classification task in clinical trial publications. Two datasets that had been annotated as part of previous projects by the author group were used to create tasks for the evaluation of two LLMs, namely Generative Pretrained Transformer 4 Omni (GPT-4o, OpenAI, San Francisco, United States) and GPT-4o mini at nine different temperature settings. The LLMs were first asked to extract the number of people who underwent randomization from the abstract of a publication reporting on a randomized clinical trial (RCT). The second task was to classify an abstract regarding whether or not it was reported on an RCT and/or an oncology topic. The answers of the LLM as well as the ground truth are provided in the dataset.