研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

在多组学数据中组合多种数据类型是否会提高或阻碍生存预测的性能?来自大规模基准研究的见解。

Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study.

发表日期:2024 Sep 02
作者: Yingxia Li, Tobias Herold, Ulrich Mansmann, Roman Hornung
来源: MEDICINE & SCIENCE IN SPORTS & EXERCISE

摘要:

基于多组学数据的预测模型结合了同一患者的多种类型的组学数据,已显示出优于单组学预测模型的潜力。该领域的大多数研究都集中在整合多种数据类型,尽管获取它们的复杂性和成本很高。普遍的假设是增加数据类型的数量必然会提高预测性能。然而,信息量较少或冗余的数据类型的集成可能会阻碍这种性能。因此,确定增强预测性能的最有效的组学数据类型组合对于经济有效且准确的预测至关重要。在这项研究中,我们系统地评估了所有 31 种可能组合的预测性能,包括五种基因组数据类型中的至少一种( mRNA、miRNA、甲基化、DNAseq 和拷贝数变异),使用 14 个具有右删失生存结果的癌症数据集,可从 TCGA 数据库公开获取。我们在每个模型中采用了各种预测方法和加权临床数据,以利用它们的预测重要性。 Harrell 的 C 指数和综合 Brier 分数被用作绩效衡量标准。为了评估我们研究结果的稳健性,我们在所包含的数据集级别上进行了引导分析。对关键结果进行统计测试,限制测试数量以确保假阳性风险较低。与预期相反,我们发现仅使用 mRNA 数据或 mRNA 和 miRNA 数据的组合足以满足大多数癌症类型的需要。对于某些癌症类型,额外包含甲基化数据可以改善预测结果。更多数据类型的引入非但没有提高性能,反而常常导致性能下降,这在两种性能指标之间有所不同。我们的研究结果挑战了在多组学生存预测中结合多种组学数据类型可以提高预测性能的普遍观念。因此,应重新考虑多组学预测中广泛采用的纳入尽可能多的数据类型的方法,以避免预测结果不佳和不必要的支出。© 2024。作者。
Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions.In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell's C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives.Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures.Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.© 2024. The Author(s).