从 RNA-seq 数据中选择癌症生物标志物的混合因果特征。

Hybrid Causal Feature Selection for Cancer Biomarker Identification from RNA-seq Data.

Original text

发表日期：2024 May 29

作者： Wenwei Xu, Hao Zhang, Yewei Xia, Yixin Ren, Jihong Guan, Shuigeng Zhou

来源： Ieee Acm T Comput Bi

摘要：

癌症生物标志物的发现有助于推进医学诊断，并在生物医学应用中发挥重要作用。大多数现有的数据驱动方法通过基于排序的策略来识别生物标志物，通常返回实际生物标志物的子集或超集，而其他一些因果特征选择方法基于马尔可夫毯子（MB）学习，面临着挑战高维的

The discovery of cancer biomarkers helps to advance medical diagnosis and plays an important role in biomedical applications. Most of the existing data-driven methods identify biomarkers by ranking-based strategies, which generally return a subset or superset of the actual biomarkers, while some other causal-wise feature selection methods are based on Markov Blanket (MB) learning, facing the challenges of high-dimensionality & low-sample. In this work, we propose a novel hybrid causal feature selection method (called CAFES) to support large-scale cancer biomarker discovery from real RNA-seq data. Concretely, CAFES first uses minimal-redundancy & maximal-relevance strategy for dimensionality reduction that returns a set of candidate features. CAFES then learns the causal skeleton w.r.t. those features by CI tests and further obtains an appropriate superset of the MB of the target variable. Finally, CAFES learns the causal structure of this superset by the DAG-GNN algorithm and then obtains the MB of the target variable, which can be treated as the cancer biomarkers. We conduct experiments to evaluate the proposed method on two real well-known RNA-seq datasets that covering both binary and multi-class cases. We compare our method CAFES with seven recent methods including Semi-HITON-MB, STMB, BAMB, FBED, LCS-FS, EEMB, and EAMB. The results show that CAFES can identify dozens of cancer biomarkers, and 1/6 ∼ 1/2 of the discovered biomarkers can be verified by existing works that they are really directly related to the corresponding disease. An advantage of CAFES is that its Recall is significantly higher than those of all the counterparts, indicating that the continuous optimization (DAG-GNN) with the returned causal skeleton after feature selection (that can be treated as a conditional independence-based constraint to the optimization problem) is effective in cancer biomarkers identification under high-dimensional and low-sample RNA-seq data. The source code of CAFES is available at https://github.com/Milkteaww/CFS.