整合人类专业知识
Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models' feasibility in clinical decision-making.
发表日期:2024 May 26
作者:
Elena Sblendorio, Vincenzo Dentamaro, Alessio Lo Cascio, Francesco Germini, Michela Piredda, Giancarlo Cicolini
来源:
BIOMEDICINE & PHARMACOTHERAPY
摘要:
ChatGPT 等大型语言模型 (LLM) 的最新增强功能使用户采用率呈指数级增长。这些模型可在移动设备上访问,并支持多模式交互,包括对话、代码生成和患者图像上传,扩大了它们为医疗保健专业人员提供临床决策实时支持的效用。尽管如此,许多作者强调了采用法学硕士可能产生的严重风险,主要与安全性和符合道德准则有关。为了应对这些挑战,我们引入了一种新颖的方法,旨在评估在医疗保健领域采用法学硕士的具体可行性领域,重点是临床护理,评估他们的表现,从而指导他们的选择。根据负责任人工智能的“经济合作与发展组织”框架,这种方法强调法学硕士坚持科学进步,优先考虑安全和护理个性化。此外,其动态性旨在适应法学硕士未来的发展。通过整合包括护理信息学在内的先进多学科知识,并辅以前瞻性文献综述,确定了七个关键领域和具体评估项目如下:专家同行评审在护理和人工智能方面进行了研究,确保了科学严谨性和洞察力的广度,以实现基本的、可重复的和连贯的方法论。通过 7 点李克特量表,定义了阈值,以便将法学硕士分为“不可用”、“高度谨慎使用”和“推荐”类别。使用这种方法在临床肿瘤学护理决策中对九名最先进的法学硕士进行了评估,产生了初步结果。 Gemini Advanced、Anthropic Claude 3 和 ChatGPT 4 获得了 State of the Art Alignment 的最低分数
Recent enhancements in Large Language Models (LLMs) such as ChatGPT have exponentially increased user adoption. These models are accessible on mobile devices and support multimodal interactions, including conversations, code generation, and patient image uploads, broadening their utility in providing healthcare professionals with real-time support for clinical decision-making. Nevertheless, many authors have highlighted serious risks that may arise from the adoption of LLMs, principally related to safety and alignment with ethical guidelines.To address these challenges, we introduce a novel methodological approach designed to assess the specific feasibility of adopting LLMs within a healthcare area, with a focus on clinical nursing, evaluating their performance and thereby directing their choice. Emphasizing LLMs' adherence to scientific advancements, this approach prioritizes safety and care personalization, according to the "Organization for Economic Co-operation and Development" frameworks for responsible AI. Moreover, its dynamic nature is designed to adapt to future evolutions of LLMs.Through integrating advanced multidisciplinary knowledge, including Nursing Informatics, and aided by a prospective literature review, seven key domains and specific evaluation items were identified as follows:A Peer Review by experts in Nursing and AI was performed, ensuring scientific rigor and breadth of insights for an essential, reproducible, and coherent methodological approach. By means of a 7-point Likert scale, thresholds are defined in order to classify LLMs as "unusable", "usable with high caution", and "recommended" categories. Nine state of the art LLMs were evaluated using this methodology in clinical oncology nursing decision-making, producing preliminary results. Gemini Advanced, Anthropic Claude 3 and ChatGPT 4 achieved the minimum score of the State of the Art Alignment & Safety domain for classification as "recommended", being also endorsed across all domains. LLAMA 3 70B and ChatGPT 3.5 were classified as "usable with high caution." Others were classified as unusable in this domain.The identification of a recommended LLM for a specific healthcare area, combined with its critical, prudent, and integrative use, can support healthcare professionals in decision-making processes.Copyright © 2024. Published by Elsevier B.V.