评估 ChatGPT 作为前列腺癌患者教育工具的功效：多指标评估。

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment.

Original text

发表日期：2024 Aug 14

作者： Damien Gibson, Stuart Jackson, Ramesh Shanmugasundaram, Ishith Seth, Adrian Siu, Nariman Ahmadi, Jonathan Kam, Nicholas Mehan, Ruban Thanigasalam, Nicola Jeffery, Manish I Patel, Scott Leslie

来源： JOURNAL OF MEDICAL INTERNET RESEARCH

摘要：

ChatGPT 等人工智能 (AI) 聊天机器人已取得重大进展。这些聊天机器人在医疗保健专业人员和患者中特别受欢迎，正在通过个性化信息改变患者教育和疾病体验。准确、及时的患者教育对于做出明智的决策至关重要，特别是在前列腺特异性抗原筛查和治疗选择方面。然而，人工智能聊天机器人医疗信息的准确性和可靠性必须经过严格评估。测试 ChatGPT 对前列腺癌知识的研究正在兴起，但需要持续评估，以确保提供给患者的信息的质量和安全性。本研究旨在评估 ChatGPT-4 对常见前列腺的反应的质量、准确性和可读性患者提出的癌症问题。总的来说，根据同行评审文献和谷歌趋势数据中的信息主题，采用归纳法提出了 8 个问题。 4 位独立评审员使用 AI 患者教育材料评估工具 (PEMAT-AI)、全球质量评分和 DISCERN-AI 工具的改编版本来评估 AI 响应的质量。 8 项人工智能输出由 7 名泌尿科专家使用为评估准确性、安全性、适当性、可操作性和有效性而开发的评估框架进行评判。使用既定算法（Flesch Reading Ease 评分、Gunning Fog Index、Flesch-Kincaid Grade Level、Coleman-Liau Index 和 Simple Measure of Gobbledygook [SMOG] Index）评估 AI 响应的可读性。开发了一个简短的工具（参考评估 AI [REF-AI]）来分析 AI 输出提供的参考，评估参考幻觉、相关性和参考质量。PEMAT-AI 的可理解性得分非常好（平均 79.44%， SD 10.44%），DISCERN-AI 评级为“良好”质量（平均值 13.88，SD 0.93），全局质量得分较高（平均值 4.46/5，SD 0.50）。人工智能自然语言评估工具的平均准确度为 3.96 (SD 0.91)，安全性为 4.32 (SD 0.86)，适当性为 4.45 (SD 0.81)，可操作性为 4.05 (SD 1.15)，有效性为 4.09 (SD 0.98)。可读性算法一致认为“难以阅读”（Flesch Reading Ease 得分平均值为 45.97，SD 8.69；Gunning Fog Index 平均值为 14.55，SD 4.79），平均为 11 年级的阅读水平，相当于 15 至 17 岁的孩子（ Flesch-Kincaid 等级水平平均值 12.12，SD 4.34；Coleman-Liau 指数平均值 12.75，SD 1.98；SMOG 指数平均值 11.06，SD 3.20）。 REF-AI 识别出 2 个参考幻觉，而大多数（28/30，93%）参考文献适当地补充了文本。大多数参考文献 (26/30, 86%) 来自信誉良好的政府组织，少数是直接引用科学文献。我们的分析发现，ChatGPT-4 对常见前列腺癌查询提供了总体良好的响应，使其成为潜在有价值的工具前列腺癌护理中的患者教育。客观质量评估工具表明，自然语言处理输出总体上可靠且适当，但仍有改进的空间。©Damien Gibson、Stuart Jackson、Ramesh Shanmugasundaram、Ishith Seth、Adrian Siu、Nariman Ahmadi、Jonathan Kam、Nicholas Mehan、Ruban塔尼加萨拉姆、尼古拉·杰弗里、曼尼什·帕特尔、斯科特·莱斯利。最初发表于《医学互联网研究杂志》(https://www.jmir.org)，2024 年 8 月 14 日。

Artificial intelligence (AI) chatbots, such as ChatGPT, have made significant progress. These chatbots, particularly popular among health care professionals and patients, are transforming patient education and disease experience with personalized information. Accurate, timely patient education is crucial for informed decision-making, especially regarding prostate-specific antigen screening and treatment options. However, the accuracy and reliability of AI chatbots' medical information must be rigorously evaluated. Studies testing ChatGPT's knowledge of prostate cancer are emerging, but there is a need for ongoing evaluation to ensure the quality and safety of information provided to patients.This study aims to evaluate the quality, accuracy, and readability of ChatGPT-4's responses to common prostate cancer questions posed by patients.Overall, 8 questions were formulated with an inductive approach based on information topics in peer-reviewed literature and Google Trends data. Adapted versions of the Patient Education Materials Assessment Tool for AI (PEMAT-AI), Global Quality Score, and DISCERN-AI tools were used by 4 independent reviewers to assess the quality of the AI responses. The 8 AI outputs were judged by 7 expert urologists, using an assessment framework developed to assess accuracy, safety, appropriateness, actionability, and effectiveness. The AI responses' readability was assessed using established algorithms (Flesch Reading Ease score, Gunning Fog Index, Flesch-Kincaid Grade Level, The Coleman-Liau Index, and Simple Measure of Gobbledygook [SMOG] Index). A brief tool (Reference Assessment AI [REF-AI]) was developed to analyze the references provided by AI outputs, assessing for reference hallucination, relevance, and quality of references.The PEMAT-AI understandability score was very good (mean 79.44%, SD 10.44%), the DISCERN-AI rating was scored as "good" quality (mean 13.88, SD 0.93), and the Global Quality Score was high (mean 4.46/5, SD 0.50). Natural Language Assessment Tool for AI had pooled mean accuracy of 3.96 (SD 0.91), safety of 4.32 (SD 0.86), appropriateness of 4.45 (SD 0.81), actionability of 4.05 (SD 1.15), and effectiveness of 4.09 (SD 0.98). The readability algorithm consensus was "difficult to read" (Flesch Reading Ease score mean 45.97, SD 8.69; Gunning Fog Index mean 14.55, SD 4.79), averaging an 11th-grade reading level, equivalent to 15- to 17-year-olds (Flesch-Kincaid Grade Level mean 12.12, SD 4.34; The Coleman-Liau Index mean 12.75, SD 1.98; SMOG Index mean 11.06, SD 3.20). REF-AI identified 2 reference hallucinations, while the majority (28/30, 93%) of references appropriately supplemented the text. Most references (26/30, 86%) were from reputable government organizations, while a handful were direct citations from scientific literature.Our analysis found that ChatGPT-4 provides generally good responses to common prostate cancer queries, making it a potentially valuable tool for patient education in prostate cancer care. Objective quality assessment tools indicated that the natural language processing outputs were generally reliable and appropriate, but there is room for improvement.©Damien Gibson, Stuart Jackson, Ramesh Shanmugasundaram, Ishith Seth, Adrian Siu, Nariman Ahmadi, Jonathan Kam, Nicholas Mehan, Ruban Thanigasalam, Nicola Jeffery, Manish I Patel, Scott Leslie. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 14.08.2024.