大型语言模型能否解决未满足的患者信息需求并减少甲状腺疾病管理中提供者的倦怠？

Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?

Original text

发表日期：2024 Oct 17

作者： Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh

来源： SURGERY

摘要：

患者的电子信息增加了临床医生的工作量，导致了倦怠。大语言模型可以回答这些患者的询问，但尚无关于甲状腺疾病中大语言模型反应的研究。这项横断面研究随机选择了 Reddit/askdocs 上发现的 52 个患者问题中的 33 个。问题是通过“甲状腺癌”或“甲状腺疾病”搜索找到的，并得到了经过验证的医生答复。使用 ChatGPT-3.5 和 GPT-4 生成其他响应。问题和回答均由盲法提供者（包括 4 名外科医生、1 名内分泌科医生和 2 名医师助理（n = 7））采用 4 点李克特量表进行匿名化，并根据准确性、质量和同理心进行评分。使用单因素方差分析对结果进行分析。就准确性而言，医生、GPT-3.5 和 GPT 的结果平均值为 2.71/4（标准差 1.04）、3.49/4（0.391）和 3.66/4（0.286）分别为 -4 (P < .01)，其中 4 = 完全真实的信息，3 = 大于 50% 的真实信息，2 = 小于 50% 的真实信息。对于质量，医生、GPT-3.5 和 GPT-4 的结果分别为 2.37/4（标准差 0.661）、2.98/4（0.352）和 3.81/4（0.36）（P < .01），其中4 = 提供了超出要求的信息，3 = 完全回答了问题，2 = 部分回答了问题。对于同理心，医生、GPT-3.5 和 GPT-4 的平均得分分别为 2.37/4（标准差 0.661）、2.80/4（0.582）和 3.14/4（0.578）（P < .01）。其中 4 = 从所表达的问题中预测并推断出患者的感受，3 = 反映了患者的感受，2 = 不包含轻蔑的评论。 GPT 的回答在 95% 的情况下排名第一。大型语言模型对患者有关甲状腺疾病的询问的回答可能比医生的回答更准确、完整、富有同理心和一致。版权所有 © 2024 Elsevier Inc. 保留所有权利。

Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a "thyroid + cancer" or "thyroid + disease" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.Copyright © 2024 Elsevier Inc. All rights reserved.