大型语言模型能否满足未被满足的患者信息需求并减少甲状腺疾病管理中的医务人员倦怠?
Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?
DOI 原文链接
用sci-hub下载
如无法下载,请从 Sci-Hub 选择可用站点尝试。
影响因子:2.7
分区:医学2区 / 外科2区
发表日期:2025 Jan
作者:
Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh
DOI:
10.1016/j.surg.2024.06.075
摘要
患者电子消息已增加临床医生的工作负担,导致倦怠感。大型语言模型可以回应这些患者查询,但尚无关于大型语言模型在甲状腺疾病中的响应研究。本横断面研究随机抽取了Reddit/askdocs平台上52个患者问题中的33个。这些问题通过“甲状腺+癌”或“甲状腺+疾病”搜索发现,且已有医生验证的回答。使用ChatGPT-3.5和GPT-4生成额外回答。问题和回答经过匿名处理,由包括4位外科医生、1位内分泌学家和2位医师助理(共7人)的盲法评估,评分内容包括准确性、质量和同理心,采用4分Likert量表。结果用单因素方差分析(ANOVA)分析。准确性方面,医生、GPT-3.5和GPT-4的平均得分分别为2.71/4(标准差1.04)、3.49/4(0.391)和3.66/4(0.286)(P<.01),4分代表完全真实信息,3分代表超过50%的真实信息,2分代表少于50%的真实信息。质量评分分别为2.37/4(标准差0.661)、2.98/4(0.352)和3.81/4(0.36)(P<.01),4分代表提供超出提问范围的信息,3分代表完全回答问题,2分代表部分回答。同理心方面,平均得分为2.37/4(标准差0.661)、2.80/4(0.582)和3.14/4(0.578)(P<.01),4分代表预判并推测患者的情感,3分代表反映患者情感,2分表示无轻视评论。GPT的回答95%的时间名列第一。大型语言模型对甲状腺疾病患者查询的回答在准确性、完整性、同理心和一致性方面有望优于医生回答。
Abstract
Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a "thyroid + cancer" or "thyroid + disease" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.