乳腺癌复发风险预测的演化:一项统计学和基于机器学习模型的系统评述。
Evolution of Breast Cancer Recurrence Risk Prediction: A Systematic Review of Statistical and Machine Learning-Based Models.
发表日期:2023 Aug
作者:
Hasna El Haji, Amine Souadka, Bhavik N Patel, Nada Sbihi, Gokul Ramasamy, Bhavika K Patel, Mounir Ghogho, Imon Banerjee
来源:
MOLECULAR & CELLULAR PROTEOMICS
摘要:
对于医学肿瘤学家来说,选择适当的佐剂治疗以最终降低乳腺癌(BC)复发风险是一项挑战。已经开发了几种使用回顾性临床数据的自动化风险预测模型,并且这些模型在复发预测因子、数据使用和预测技术(统计/机器学习[ML])方面在多年间有了显著的发展。根据PRISMA指南,我们通过搜索五个数字数据库——PubMed、ScienceDirect、Scopus、Cochrane和Web of Science,对上述统计和机器学习模型在2022年1月至2022年12月期间发表的文献进行了系统文献回顾。这次全面的搜索总共获得了163篇文章,并经过筛选过程,只有23篇文章被认为适合进一步分析。我们使用开发、评估指标和验证策略对这些研究进行了基准评估,同时强调研究中涉及的患者种族多样性。总共,30.4%的研究采用统计技术,而69.6%采用基于机器学习的方法。在这些方法中,传统的机器学习模型(支持向量机、决策树、逻辑回归和朴素贝叶斯)是最常使用的(26.1%),其次是深度学习(26.1%)。深度学习和集成学习提供了最准确的预测(AUC=0.94)。基于机器学习的预测模型表现出色的性能,但其实际适用性可能会受到可解释性有限和泛化能力降低的影响。此外,针对BC复发的预测模型通常集中在与肿瘤、治疗、分子和临床特征相关的有限变量上。不平衡的类别和缺乏开源数据集阻碍了模型的开发和验证。此外,现有模型主要忽视了非洲和中东地区的人群,因为这些模型的训练和验证主要是基于高加索和亚洲患者。
Selection of appropriate adjuvant therapy to ultimately reduce the risk of breast cancer (BC) recurrence is a challenge for medical oncologists. Several automated risk prediction models have been developed using retrospective clinical data and have evolved significantly over the years in terms of predictors of recurrence, data usage, and predictive techniques (statistical/machine learning [ML]).Following PRISMA guidelines, we performed a systematic literature review of the aforementioned statistical and ML models published between January 2008 and December 2022 through searching five digital databases-PubMed, ScienceDirect, Scopus, Cochrane, and Web of Science. The comprehensive search yielded a total of 163 papers and after a screening process focusing on papers that dealt exclusively with statistical/ML methods, only 23 papers were deemed appropriate for further analysis. We benchmarked the studies on the basis of development, evaluation metrics, and validation strategy with an added emphasis on racial diversity of patients included in the studies.In total, 30.4% of the included studies use statistical techniques, while 69.6% are ML-based. Among these, traditional ML models (support vector machines, decision tree, logistic regression, and naïve Bayes) are the most frequently used (26.1%) along with deep learning (26.1%). Deep learning and ensemble learning provide the most accurate predictions (AUC = 0.94 each).ML-based prediction models exhibit outstanding performance, yet their practical applicability might be hindered by limited interpretability and reduced generalization. Moreover, predictive models for BC recurrence often focus on limited variables related to tumor, treatment, molecular, and clinical features. Imbalanced classes and the lack of open-source data sets impede model development and validation. Furthermore, existing models predominantly overlook African and Middle Eastern populations, as they are trained and validated mainly on Caucasian and Asian patients.