研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

De Novo 自然语言处理算法从病理报告中准确识别粘液纤维肉瘤。

De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.

发表日期:2024 Oct 02
作者: Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle
来源: CLINICAL ORTHOPAEDICS AND RELATED RESEARCH

摘要:

ICD-10 中的可用代码不能准确反映软组织肉瘤的诊断,这可能导致数据库中软组织肉瘤的代表性不足。由于所有临床结果和病理报告的可用性,国家 VA 数据库为软组织肉瘤研究提供了独特的机会。在软组织肉瘤的背景下,自然语言处理(NLP)有潜力应用于病理报告等临床文档,以独立于 ICD 代码识别软组织肉瘤,从而使肉瘤研究人员能够建立更全面的数据库,能够回答无数问题(1) 仅通过软组织肉瘤 ICD 代码搜索会漏掉国家 VA 数据库中粘液纤维肉瘤患者的比例是多少? (2) 是否有一种 de novo NLP 算法能够分析病理报告以准确识别粘液纤维肉瘤患者?从 2003 年到 2022 年,识别了国家 VA 公司数据仓库中的所有病理报告(1070 万份)。使用单词搜索功能,报告从 403 名退伍军人中被发现含有“粘液纤维肉瘤”一词。由此产生的病理报告经过人工审查,以制定一个黄金标准队列,其中仅包含那些经病理学家确认患有粘液纤维肉瘤诊断的退伍军人。该队列的平均±标准差年龄为 70±12 岁,其中 96%(300 人中的 287 人)为男性。提取诊断代码,并比较适当的 ICD 编码的差异。使用粘液纤维肉瘤的混杂因素、否定和强调术语对 NLP 算法进行迭代完善和测试。通过与人工审查的金标准队列进行比较,计算了 NLP 生成的队列的敏感性、特异性、阳性预测值 (PPV)、阴性预测值 (NPV) 和准确性。27% 的记录(300 人中的 81 人) VA 数据库中的粘液纤维肉瘤患者缺少肉瘤 ICD 代码。与 ICD 代码(73% [300 中的 219])或基本单词搜索(74% [403 中的 300])相比,从头 NLP 算法更准确地识别粘液纤维肉瘤患者(92% [300 中的 276])(p < 0.001) )。生成的三个最终算法模型的准确度范围为 92% 至 100%。 NLP 算法可以从病理报告中高精度识别粘液纤维肉瘤患者,这比基于 ICD 的队列创建和简单的单词搜索有所改进。该算法可在 GitHub (https://github.com/sarcoma-shark/myxofascia-shark) 上免费获取,并可通过在其他队列中进行测试来促进外部验证和改进。II 级,诊断研究。版权所有 © 2024骨与关节外科医生协会。
Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.(1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.Level II, diagnostic study.Copyright © 2024 by the Association of Bone and Joint Surgeons.