快速、准确且没有种族偏见的癌症全基因组肿瘤变异检测,采用表格机器学习。
Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning.
发表日期:2023 Jan 07
作者:
R Tyler McLaughlin, Maansi Asthana, Marc Di Meo, Michele Ceccarelli, Howard J Jacob, David L Masica
来源:
npj Precision Oncology
摘要:
精准识别体细胞突变对于精准肿瘤学至关重要,对于计算肿瘤突变负担(TMB)也至关重要,TMB是免疫治疗反应的重要预测因子。对于仅处理肿瘤变异(即仅测序肿瘤组织活检样本而非患者正常组织样本),准确区分体细胞突变和种系突变是一个具有挑战性的问题,因为未得到解决会导致TMB估计不可靠、有偏差且过高。 在这里,我们运用机器学习TabNet、XGBoost和LightGBM三种用于表格数据的机器学习模型,在仅处理实体肿瘤样本的情况下完成了体细胞和种系分类的任务。我们使用仅从肿瘤变异中提取的特征构建了一个训练集,对监督分类进行训练,并从一个独立流程中利用患者匹配的正常样本绘制了体细胞和种系的真实标签。所有三个训练过的模型在两个保留的测试数据集上都达到了最先进的表现:一个TCGA数据集,包括肉瘤、乳腺癌和子宫内膜癌样本(AUC>94%),以及一个转移性黑色素瘤数据集(AUC>85%)。匹配正常与仅处理肿瘤的TMB之间的一致性从R2=0.006提高到0.71-0.76,加上机器学习分类器后,LightGBM表现最佳。值得注意的是,这些机器学习模型可以概括涵盖100%的癌症亚型和捕获套件。我们重复了最近的研究结果,即对于黑人患者,仅处理肿瘤TMB估计相对于白人患者极度过高,这是由于种系数据库的种族偏见造成的。我们展示了我们采用XGBoost和LightGBM的方法消除了仅处理肿瘤变异中的种族偏见。 ©2023。作者(们)。
Accurately identifying somatic mutations is essential for precision oncology and crucial for calculating tumor-mutational burden (TMB), an important predictor of response to immunotherapy. For tumor-only variant calling (i.e., when the cancer biopsy but not the patient's normal tissue sample is sequenced), accurately distinguishing somatic mutations from germline variants is a challenging problem that, when unaddressed, results in unreliable, biased, and inflated TMB estimates. Here, we apply machine learning to the task of somatic vs germline classification in tumor-only solid tumor samples using TabNet, XGBoost, and LightGBM, three machine-learning models for tabular data. We constructed a training set for supervised classification using features derived exclusively from tumor-only variant calling and drawing somatic and germline truth labels from an independent pipeline using the patient-matched normal samples. All three trained models achieved state-of-the-art performance on two holdout test datasets: a TCGA dataset including sarcoma, breast adenocarcinoma, and endometrial carcinoma samples (AUC > 94%), and a metastatic melanoma dataset (AUC > 85%). Concordance between matched-normal and tumor-only TMB improves from R2 = 0.006 to 0.71-0.76 with the addition of a machine-learning classifier, with LightGBM performing best. Notably, these machine-learning models generalize across cancer subtypes and capture kits with a call rate of 100%. We reproduce the recent finding that tumor-only TMB estimates for Black patients are extremely inflated relative to that of white patients due to the racial biases of germline databases. We show that our approach with XGBoost and LightGBM eliminates this significant racial bias in tumor-only variant calling.© 2023. The Author(s).