研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

通过专家确定的参考标准评估深度学习模型的诊断表现:胸片恶性肺结节检测任务的效果。

Effects of Expert-Determined Reference Standards in Evaluating the Diagnostic Performance of a Deep Learning Model: A Malignant Lung Nodule Detection Task on Chest Radiographs.

发表日期:2023 Feb
作者: Jung Eun Huh, Jong Hyuk Lee, Eui Jin Hwang, Chang Min Park
来源: KOREAN JOURNAL OF RADIOLOGY

摘要:

很少人了解在评估基于深度学习的自动检测(DLAD)模型的表现和其对放射科医生的附加价值时,使用不同的专家确定的参考标准的效果。我们评估了专家确定的标准与临床金标准(即病理确认)的一致性,以及使用不同专家确定的参考标准对放射科医生在胸部X线上检测恶性肺结节的诊断表现的估计影响,并考虑了DLAD模型的辅助影响。该研究包括来自50名已被病理证实为肺癌的患者和50个对照组的胸部X线。使用10个专家的解读构建了5个专家确定的标准:由最有经验的专家进行的个人判断、多数投票、两个和三个专家的共识判断以及潜在类分析(LCA)模型。在单独的读者测试中,另外10名放射科医生独立地解读了放射图像,然后使用DLAD模型进行辅助。他们的诊断表现使用临床金标准和各种专家确定的标准作为参考标准来估计,并使用Bonferroni校正进行比较。LCA模型(敏感性为72.6%,特异性为100%)最接近临床金标准。当使用专家确定的标准时,放射科医生和DLAD模型的灵敏度被高估,其特异性被低估(所有p值<0.05)。 DLAD辅助减少了灵敏度的高估,但夸大了特异性的低估(所有p值<0.001)。当使用临床金标准而不是使用专家确定的标准时,DLAD模型在提高灵敏度和特异性方面的效果更好(所有p值<0.001),除了LCA模型的敏感度(p = 0.094)。 LCA模型对于在胸部X线上检测恶性肺结节具有最接近临床金标准的表现。专家确定的标准会导致在测量人工智能模型的诊断表现时产生偏差。版权所有©2023年韩国放射学会。
Little is known about the effects of using different expert-determined reference standards when evaluating the performance of deep learning-based automatic detection (DLAD) models and their added value to radiologists. We assessed the concordance of expert-determined standards with a clinical gold standard (herein, pathological confirmation) and the effects of different expert-determined reference standards on the estimates of radiologists' diagnostic performance to detect malignant pulmonary nodules on chest radiographs with and without the assistance of a DLAD model.This study included chest radiographs from 50 patients with pathologically proven lung cancer and 50 controls. Five expert-determined standards were constructed using the interpretations of 10 experts: individual judgment by the most experienced expert, majority vote, consensus judgments of two and three experts, and a latent class analysis (LCA) model. In separate reader tests, additional 10 radiologists independently interpreted the radiographs and then assisted with the DLAD model. Their diagnostic performance was estimated using the clinical gold standard and various expert-determined standards as the reference standard, and the results were compared using the t test with Bonferroni correction.The LCA model (sensitivity, 72.6%; specificity, 100%) was most similar to the clinical gold standard. When expert-determined standards were used, the sensitivities of radiologists and DLAD model alone were overestimated, and their specificities were underestimated (all p-values < 0.05). DLAD assistance diminished the overestimation of sensitivity but exaggerated the underestimation of specificity (all p-values < 0.001). The DLAD model improved sensitivity and specificity to a greater extent when using the clinical gold standard than when using the expert-determined standards (all p-values < 0.001), except for sensitivity with the LCA model (p = 0.094).The LCA model was most similar to the clinical gold standard for malignant pulmonary nodule detection on chest radiographs. Expert-determined standards caused bias in measuring the diagnostic performance of the artificial intelligence model.Copyright © 2023 The Korean Society of Radiology.