从不平衡数据中学习:先进重采样技术和机器学习模型的集成,以增强癌症诊断和预后。
Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis.
发表日期:2024 Oct 08
作者:
Fatih Gurcan, Ahmet Soylu
来源:
Cancers
摘要:
本研究旨在评估跨多个诊断和预后癌症数据集的各种分类算法和重采样方法的性能,解决类别不平衡的挑战。总共分析了五个数据集,包括三个诊断数据集(威斯康星州乳腺癌数据库、癌症预测数据集) 、肺癌检测数据集)和两个预后数据集(Seer 乳腺癌数据集、分化甲状腺癌复发数据集)。采用来自三个类别的 19 种重采样方法,并使用来自四个不同类别的 10 个分类器进行比较。结果表明,混合采样方法,特别是 SMOTEENN,取得了最高的平均性能,达到 98.19%,其次是 IHT(97.20%)和 RENN (96.48%)。在分类器方面,随机森林表现出最好的性能,平均值为 94.69%,平衡随机森林和 XGBoost 紧随其后。基线方法(无重采样)的性能明显较低,为 91.33%,凸显了重采样技术在改善模型结果方面的有效性。这项研究强调了重采样方法在增强不平衡数据集分类性能方面的重要性,为研究人员和医疗保健提供了宝贵的见解专业人士。这些发现为未来旨在将机器学习技术整合到癌症诊断和预后中的研究奠定了基础,并为混合模型和临床应用的进一步研究提供了建议。
This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance.A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison.The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes.This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.