基于新颖集成方法的基因调控网络推断
Gene regulatory network inference based on novel ensemble method
DOI 原文链接
用sci-hub下载
如无法下载,请从 Sci-Hub 选择可用站点尝试。
影响因子:2.5
分区:生物学4区 / 生物工程与应用微生物4区 遗传学4区
发表日期:2024 Dec 06
作者:
Bin Yang, Jing Li, Xiang Li, Sanrong Liu
DOI:
10.1093/bfgp/elae036
摘要
基因调控网络(GRNs)有助于理解基因的功能以及癌症的发展或关键基因对疾病的影响,因此,本研究提出一种基于13种基础分类方法和灵活神经树(FNT)的集成方法,以提高GRN识别的准确性。主要分类方法包括岭分类、随机梯度下降、高斯过程分类、伯努利朴素贝叶斯、自适应提升、梯度提升决策树、直方图梯度提升分类、极端梯度提升(XGBoost)、多层感知机、轻量级梯度提升机、随机森林、支持向量机和k最近邻算法,这些作为FNT模型的输入变量集。此外,开发了一种基于基因编程变体和粒子群优化的混合进化算法,用于搜索最优的FNT模型。在三个模拟数据集和三个真实单细胞RNA测序数据集上的实验表明,提出的集成特征在ROC曲线下面积、精确率-召回曲线下面积和F1指标方面均优于13种监督算法、7种无监督算法(ARACNE、CLR、GENIE3、MRNET、PCACMI、GENECI和EPCACMI)以及4种单细胞特异性方法(SCODE、BiRGRN、LEAP和BiGBoost)。
Abstract
Gene regulatory networks (GRNs) contribute toward understanding the function of genes and the development of cancer or the impact of key genes on diseases. Hence, this study proposes an ensemble method based on 13 basic classification methods and a flexible neural tree (FNT) to improve GRN identification accuracy. The primary classification methods contain ridge classification, stochastic gradient descent, Gaussian process classification, Bernoulli Naive Bayes, adaptive boosting, gradient boosting decision tree, hist gradient boosting classification, eXtreme gradient boosting (XGBoost), multilayer perceptron, light gradient boosting machine, random forest, support vector machine, and k-nearest neighbor algorithm, which are regarded as the input variable set of FNT model. Additionally, a hybrid evolutionary algorithm based on a gene programming variant and particle swarm optimization is developed to search for the optimal FNT model. Experiments on three simulation datasets and three real single-cell RNA-seq datasets demonstrate that the proposed ensemble feature outperforms 13 supervised algorithms, seven unsupervised algorithms (ARACNE, CLR, GENIE3, MRNET, PCACMI, GENECI, and EPCACMI) and four single cell-specific methods (SCODE, BiRGRN, LEAP, and BiGBoost) based on the area under the receiver operating characteristic curve, area under the precision-recall curve, and F1 metrics.