大型语言模型能否满足未满足的患者信息需求并减少甲状腺疾病管理中提供者的倦怠?
Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?
影响因子:2.70000
分区:医学2区 / 外科2区
发表日期:2025 Jan
作者:
Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh
摘要
患者电子消息传递增加了临床医生的工作量,导致倦怠。大型语言模型可以响应这些患者的疑问,但是在甲状腺疾病中没有关于大语言模型反应的研究。这项横断面研究随机选择了Reddit/AskDocs中的52个患者问题中的33个。通过“甲状腺+癌症”或“甲状腺+疾病”搜索发现问题,并验证了态度的反应。使用Chatgpt-3.5和GPT-4生成其他响应。问题和回答被匿名化,并通过盲人提供者的4点李克特量表(包括4位外科医生,1名内分泌学家和2位医师助理(n = 7))对准确性,质量和同理心进行了匿名和评分。使用差异的单因素分析来分析结果。对于准确性,结果平均为2.71/4(标准偏差1.04),3.49/4(0.391)和3.66/4(0.286),分别为GPT-3.5和GPT-4的医生,GPT-3.5和GPT-4分别(P <.01)(p <.01)(p <.01)(p <.01),比较4 =完全正确的信息。为了质量,对于医师,GPT-3.5和GPT-4的结果为2.37/4(标准偏差0.661),2.98/4(0.352)和3.81/4(0.36)(0.36)(p <.01)(p <.01),其中4 = 4 =提供的信息超出了问题,3 =完全回答了问题,以及2 =部分答案。 For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. GPT的反应排名为前95%的时间。LARGE语言模型对患者对甲状腺疾病的疑问的反应可能比医师反应更准确,完整,善解人意和一致。
Abstract
Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a "thyroid + cancer" or "thyroid + disease" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.