通过全宏基因组测序数据对诊断胃肠道疾病的机器学习方法进行综合评估。
Comprehensive assessment of machine learning methods for diagnosing gastrointestinal diseases through whole metagenome sequencing data.
发表日期:2024
作者:
Sungho Lee, Insuk Lee
来源:
Disease Models & Mechanisms
摘要:
肠道微生物组与宿主疾病密切相关,为通过机器学习 (ML) 管道进行疾病诊断提供了潜力。这些流程对于使用高维微生物组数据进行疾病建模至关重要,涉及选择概况模式、数据预处理技术和分类算法,每一个都会影响模型的准确性和普遍性。尽管全宏基因组鸟枪测序 (WMS) 在人类肠道微生物组分析中越来越受欢迎,但关于使用 WMS 数据进行疾病诊断的 ML 流程的最佳方法仍然难以达成共识。为了解决这一差距,我们使用来自 21 项病例对照研究的 2,553 个粪便 WMS 样本,全面评估了诊断克罗恩病和结直肠癌的 ML 方法。我们的研究揭示了重要的见解:肠道特异性、物种级别的分类特征被证明是最有效的分析;批量校正并不总是有利于模型性能;组合数据转换显着改进了模型;虽然非线性集成分类算法通常提供卓越的性能,但经过适当正则化的线性模型被发现对于基于微生物组数据可线性分离的疾病更有效。使用保留数据验证了集成了最有效方法的最佳机器学习管道的通用性。这项研究为利用粪便 WMS 数据构建可靠的疾病诊断 ML 模型提供了实用指南。
The gut microbiome, linked significantly to host diseases, offers potential for disease diagnosis through machine learning (ML) pipelines. These pipelines, crucial in modeling diseases using high-dimensional microbiome data, involve selecting profile modalities, data preprocessing techniques, and classification algorithms, each impacting the model accuracy and generalizability. Despite whole metagenome shotgun sequencing (WMS) gaining popularity for human gut microbiome profiling, a consensus on the optimal methods for ML pipelines in disease diagnosis using WMS data remains elusive. Addressing this gap, we comprehensively evaluated ML methods for diagnosing Crohn's disease and colorectal cancer, using 2,553 fecal WMS samples from 21 case-control studies. Our study uncovered crucial insights: gut-specific, species-level taxonomic features proved to be the most effective for profiling; batch correction was not consistently beneficial for model performance; compositional data transformations markedly improved the models; and while nonlinear ensemble classification algorithms typically offered superior performance, linear models with proper regularization were found to be more effective for diseases that are linearly separable based on microbiome data. An optimal ML pipeline, integrating the most effective methods, was validated for generalizability using holdout data. This research offers practical guidelines for constructing reliable disease diagnostic ML models with fecal WMS data.