基于Wilcoxon符号秩和以及新颖的灰狼优化集成学习模型的微阵列基因表达数据分类。
Microarray Gene Expression Data Classification Via Wilcoxon Sign Rank Sum and Novel Grey Wolf Optimized Ensemble Learning Models.
发表日期:2023 Aug 15
作者:
Yakub K Saheed, Bukola F Balogun, Braimah J Odunayo, Abdulsalam Mustapha
来源:
Disease Models & Mechanisms
摘要:
癌症是一种致命的疾病,影响着全世界人们的生活。寻找与单一癌症疾病相关的几个基因可以导致有效的治疗方法。微阵列数据集的困难在于其高维度性;与数据集中较少的样本相比,它们具有大量的特征。此外,微阵列数据通常表现出显著的维度不对称性,以及高级别的冗余和噪声。广泛认为,大多数基因对于所研究的类别缺乏信息价值。最近的研究试图通过采用各种特征选择技术来减少这种高维度性。本文提出了一种新的集成特征选择技术,通过Wilcoxon符号秩和检验(WCSRS)和Fisher检验(F-test)。在实验的第一阶段,进行了数据预处理;随后,通过WCSRS和F-test进行了特征选择,以使WCSRS和F-test的(p-value)概率值用于癌基因鉴定。提取的基因集用于使用集成学习模型(ELM)、随机森林(RF)、极限梯度增强(Xgboost)、Catboost和Adaboost对癌症患者进行分类。为了提高ELM的性能,我们使用Grey Wolf优化器(GWO)对所有ELM的参数进行了优化。实验分析是在结肠癌上进行的,其中包括62名患者(40例恶性和22例良性)的2000个基因。使用WCSRS测试进行特征选择时,经过优化的Xgboost展示了100%的准确率。然而,使用F-test进行特征选择时,经过优化的Catboost展示了100%的准确率。这相对于文献中以前报道的数值增加了15%。
Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the small number of samples in these datasets. Additionally, microarray data typically exhibit significant asymmetry in dimensionality as well as high levels of redundancy and noise. It is widely held that the majority of genes lack informative value about the classes under study. Recent research has attempted to reduce this high dimensionality by employing various feature selection techniques. This paper presents new ensemble feature selection techniques via the Wilcoxon Sign Rank Sum test (WCSRS) and the Fisher's test (F-test). In the first phase of the experiment, data preprocessing was performed; subsequently, feature selection was performed via the WCSRS and F-test in such a way that the (probability values) p-values of the WCRSR and F-test were adopted for cancerous gene identification. The extracted gene set was used to classify cancer patients using ensemble learning models (ELM), random forest (RF), extreme gradient boosting (Xgboost), cat boost, and Adaboost. To boost the performance of the ELM, we optimized the parameters of all the ELMs using the Grey Wolf optimizer (GWO). The experimental analysis was performed on colon cancer, which included 2000 genes from 62 patients (40 malignant and 22 benign). Using a WCSRS test for feature selection, the optimized Xgboost demonstrated 100% accuracy. The optimized cat boost, on the other hand, demonstrated 100% accuracy using the F-test for feature selection. This represents a 15% improvement over previously reported values in the literature.