ChatGPT和GPT-4在神经外科笔试考试中的表现。

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.

Original text

发表日期：2023 Aug 15

作者： Rohaid Ali, Oliver Y Tang, Ian D Connolly, Patricia L Zadnik Sullivan, John H Shin, Jared S Fridley, Wael F Asaad, Deus Cielo, Adetokunbo A Oyelese, Curtis E Doberstein, Ziya L Gokaslan, Albert E Telfeian

来源： NEUROSURGERY

摘要：

生成式大型语言模型（LLM）的兴趣迅速增长。虽然通用LLM ChatGPT（GPT-3.5）在医学生考试中表现出接近及格的成绩，但ChatGPT或其后续版本GPT-4在专业考试中的表现以及影响准确性的因素尚不清楚。本研究旨在评估ChatGPT和GPT-4在500道神经外科书面考试模拟题中的表现。我们使用了美国神经外科自评考试1（Self-Assessment Neurosurgery Examinations, SANS）来评估ChatGPT和GPT-4。问题以单一最佳答案的多项选择题形式呈现。我们使用χ2检验、Fisher精确检验和单变量logistic回归分析来评估问题特征与表现差异之间的关系。ChatGPT（GPT-3.5）和GPT-4分别相对于用户平均水平获得了73.4%（95% CI: 69.3%-77.2%）和83.4%（95% CI: 79.8%-86.5%）的得分，而用户平均水平为72.8%（95% CI: 68.6%-76.6%）。两个LLM的得分均超过了去年的及格线69%。尽管ChatGPT与问题库用户的得分相当（P = .963），但GPT-4的表现优于两者（P < .001）。GPT-4回答了ChatGPT回答正确的每一道问题，还正确回答了剩余不正确问题中的37.6%（50/133）。在12个问题类别中，GPT-4在每个类别中的表现均明显优于用户，但在3个类别中与ChatGPT相当（功能性、其他综合和脊柱），对于肿瘤问题，GPT-4的表现优于用户和ChatGPT。ChatGPT的准确率与问题字数增加（每+10个单词回答问题的正确概率比为0.89）和高级问题解决能力（比为0.40，P = .009）呈负相关，但GPT-4的准确率与这两个因素无关（P > .005）。本研究期间多模态输入不可用，因此，基于上下文线索，ChatGPT和GPT-4分别仅通过上下文信息正确回答了49.5%和56.8%的图像问题。LLM在500道神经外科书面考试模拟题上取得了及格成绩，其中GPT-4表现显著优于ChatGPT。版权所有 © 2023 美国神经外科学会。保留所有权利。

Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ2, Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent (P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.Copyright © Congress of Neurological Surgeons 2023. All rights reserved.