Human review for post-training improvement of low-resource language performance in large language models
Data files
Apr 25, 2024 version files 23.27 KB
-
README.md
888 B
-
VaccineBotTranscriptReviewDataset.xlsx
22.38 KB
Abstract
Large language models (LLMs) have significantly improved natural language processing, holding the potential to support health workers and their clients directly. Unfortunately, there is a substantial and variable drop in performance for low-resource languages. Here we present results from an exploratory case study in Malawi, aiming to enhance the performance of LLMs in Chichewa through innovative prompt engineering techniques. By focusing on practical evaluations over traditional metrics, we assess the subjective utility of LLM outputs, prioritizing end-user satisfaction. Our findings suggest that tailored prompt engineering may improve LLM utility in underserved linguistic contexts, offering a promising avenue to bridge the language inclusivity gap in digital health interventions.
https://doi.org/10.5061/dryad.4xgxd25jb
This dataset comprises a single Excel file with transcript review survey results as reported by a cohort of community health volunteers (CHVs) in Malawi. CHVs were each asked to review and rate four pre-assigned transcripts, each generated from 1 of 5 Chichewa-speaking chatbot variations differing in temperature model parameter and/or changes to the system prompt. This survey was designed to collect CHV feedback on the quality of Chichewa spoken by the chatbot, given the performance gap between higher- and lower-resource languages. Results suggest that the use of specific prompt engineering techniques may improve foundational model utility when conversing using low-resource languages.
We compared the reported performance of five variations of an LLM-based chatbot prototype through a two-step process. First, a cohort of 24 target end users, community health volunteers (CHVs) in Malawi, was recruited to generate transcripts by interacting with the prototypes. Second, an additional cohort of 22 CHVs was recruited to evaluate the transcripts and provide subjective feedback on the quality and utility of the language in the model’s responses.
CHVs in the first cohort were preassigned to one of the five chatbot variations and were instructed to generate a single transcript by interacting with their assined chatbot variation. A second set of CHVs was then recruited to conduct the transcript review. This second group of CHVs was also recruited through convenience sampling to form an intended cohort of 25 participants; however, three were unable to attend. A total of 22 CHVs participated in the transcript review, where each CHV was instructed to review and rate four of the transcripts generated by the previous cohort. Prior to the transcript review, duplicate transcripts and those with insufficient length were excluded from the evaluation pool. We used a stratified allocation method to assign transcripts to participants, ensuring that a single participant would neither receive the same transcript nor more than one transcript from the same bot variation. The order in which each CHV would review their assigned transcripts was then randomized, and transcripts were printed and labeled with a unique 4-digit identification code for CHVs to reference when providing their ratings. Participants were blinded to the authors of the original transcript, as well as variation that was used for the chatbot.
CHVs were asked to review each transcript in their assigned order and complete both a brief demographic survey and a language survey. The language survey dataset is shared here. All survey data were de-identified prior to analysis. Ratings were removed if the 4-digit transcript identification code entered into the survey by CHVs did not match predetermined assignments.