研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

多组学数据综合分类方法的比较分析。

Comparative analysis of integrative classification methods for multi-omics data.

发表日期:2024 May 23
作者: Alexei Novoloaca, Camilo Broc, Laurent Beloeil, Wen-Han Yu, Jérémie Becker
来源: BRIEFINGS IN BIOINFORMATICS

摘要:

测序、质谱和细胞计数技术的最新进展使研究人员能够从单个样本中收集多种“组学”数据类型。这些大型数据集已经形成了越来越多的共识,即需要一种整体方法来识别新的候选生物标志物并揭示疾病病因学的机制,这是精准医学的关键。虽然许多评论和基准都是针对无监督方法进行的,但有监督的方法在文献中受到的关注较少,而且尚未出现黄金标准。在这项工作中,我们对六种方法进行了彻底的比较,这些方法代表了中间综合方法的主要系列(矩阵分解、多核方法、集成学习和基于图的方法)。作为非综合控制,随机森林是在串联和分离的数据类型上执行的。对模拟和真实数据集的分类性能进行了评估,后者经过精心选择以涵盖不同的医学应用(传染病、肿瘤学和疫苗)和数据模式。根据真实数据集设计了总共 15 个模拟场景,以探索大型且真实的参数空间(例如样本大小、维度、类别不平衡、效应大小)。根据实际数据,方法比较表明,综合方法比非综合方法表现更好或同样好。相比之下,DIABLO 和四种随机森林替代方案在大多数模拟场景中都优于其他方案。详细讨论了这些方法的优点和局限性以及未来应用的指南。© 作者 2024。由牛津大学出版社出版。
Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple 'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.© The Author(s) 2024. Published by Oxford University Press.