基于贝叶斯潜在类的朴素贝叶斯分类器扩展及其在胃癌患者分类中的应用。
A Bayesian latent class extension of naive Bayesian classifier and its application to the classification of gastric cancer patients.
发表日期:2023 Aug 21
作者:
Kimiya Gohari, Anoshirvan Kazemnejad, Marjan Mohammadi, Farzad Eskandari, Samaneh Saberi, Maryam Esmaieli, Ali Sheidaei
来源:
BMC Medical Research Methodology
摘要:
朴素贝叶斯(Naive Bayes, NB)分类器是机器学习(Machine Learning, ML)中广泛使用的强大的监督算法。然而,其有效性依赖于条件独立性的严格假设,而在现实世界的场景中这种假设经常被违反。为了解决这个限制,各种研究已经探索了各种扩展NB以解决数据的非条件独立性问题的方法。这些方法可以大致分为两个主要类别:特征选择和结构扩展。在这个特定的研究中,我们提出了一种新颖的方法来增强NB,通过引入一个潜变量作为属性的父节点。我们使用一种灵活的技术称为贝叶斯潜类分析(Bayesian Latent Class Analysis, BLCA)来定义这个潜变量。因此,我们的最终模型结合了NB和BLCA的优势,形成了我们所称的NB-BLCA。通过引入潜变量,我们旨在捕捉属性之间的复杂依赖关系,提高分类器的整体性能。参数学习提供了期望最大化(Expectation-Maximization, EM)算法和吉布斯采样方法两种。通过进行模拟研究来评估模型的分类效果与普通NB模型相比。此外,使用与976名胃癌(Gastric Cancer, GC)和1189名非溃疡性消化不良(Non-ulcer dyspepsia, NUD)患者相关的真实世界数据,以展示模型在实际应用中的性能。使用十折交叉验证评估模型的有效性。根据测试数据中更高的分类敏感性和特异性,该模型在所有模拟场景中均优于普通NB。使用吉布斯采样准确率的NB-BLCA模型为87.77(95% CI:84.87-90.29)。而使用EM算法和普通NB分类器的NB-BLCA模型的该指数分别估计为77.22(95% CI:73.64-80.53)和74.71(95% CI:71.02-78.15)。当考虑修改NB分类器时,将潜变量纳入模型中在医学和健康相关领域特别有许多优势。通过这样做,研究人员可以避开在局部学习和结构扩展方法中需要的广泛搜索算法和结构学习。纳入潜类变量可以在模型构建过程中融合所有属性。因此,当独立性假设被违反时,尤其是在涉及健康和医学的领域,NB-BLCA模型是传统NB分类器的合适替代品。© 2023. BioMed Central Ltd., Springer Nature一部分。
The Naive Bayes (NB) classifier is a powerful supervised algorithm widely used in Machine Learning (ML). However, its effectiveness relies on a strict assumption of conditional independence, which is often violated in real-world scenarios. To address this limitation, various studies have explored extensions of NB that tackle the issue of non-conditional independence in the data. These approaches can be broadly categorized into two main categories: feature selection and structure expansion. In this particular study, we propose a novel approach to enhancing NB by introducing a latent variable as the parent of the attributes. We define this latent variable using a flexible technique called Bayesian Latent Class Analysis (BLCA). As a result, our final model combines the strengths of NB and BLCA, giving rise to what we refer to as NB-BLCA. By incorporating the latent variable, we aim to capture complex dependencies among the attributes and improve the overall performance of the classifier.Both Expectation-Maximization (EM) algorithm and the Gibbs sampling approach were offered for parameter learning. A simulation study was conducted to evaluate the classification of the model in comparison with the ordinary NB model. In addition, real-world data related to 976 Gastric Cancer (GC) and 1189 Non-ulcer dyspepsia (NUD) patients was used to show the model's performance in an actual application. The validity of models was evaluated using the 10-fold cross-validation.The presented model was superior to ordinary NB in all the simulation scenarios according to higher classification sensitivity and specificity in test data. The NB-BLCA model using Gibbs sampling accuracy was 87.77 (95% CI: 84.87-90.29). This index was estimated at 77.22 (95% CI: 73.64-80.53) and 74.71 (95% CI: 71.02-78.15) for the NB-BLCA model using the EM algorithm and ordinary NB classifier, respectively.When considering the modification of the NB classifier, incorporating a latent component into the model offers numerous advantages, particularly within medical and health-related contexts. By doing so, the researchers can bypass the extensive search algorithm and structure learning required in the local learning and structure extension approach. The inclusion of latent class variables allows for the integration of all attributes during model construction. Consequently, the NB-BLCA model serves as a suitable alternative to conventional NB classifiers when the assumption of independence is violated, especially in domains pertaining to health and medicine.© 2023. BioMed Central Ltd., part of Springer Nature.