基于Transformer的表示学习与多实例学习在仅通过甲硫酸盐处理的血浆游离DNA原始片段中进行癌症诊断的研究

Transformer-based representation learning and multiple-instance learning for cancer diagnosis exclusively from raw sequencing fragments of bisulfite-treated plasma cell-free DNA

DOI 原文链接

用sci-hub下载

Molecular Oncology

影响因子:4.5

分区:医学2区 / 肿瘤学3区

发表日期:2024 Nov

作者: Jilei Liu, Hongru Shen, Yichen Yang, Meng Yang, Qiang Zhang, Kexin Chen, Xiangchun Li

DOI: 10.1002/1878-0261.13745

摘要

从甲硫酸盐处理的游离DNA（cfDNA）片段进行早期癌症诊断，需繁琐的数据分析流程。在此，我们提出一种基于深度学习的早期癌症拦截与诊断（DECIDIA）方法，能够仅通过甲硫酸盐处理的cfDNA测序片段实现高准确度的癌症诊断。DECIDIA依赖于Transformer模型的DNA片段表示学习和弱监督多实例学习进行分类。我们系统性评估了DECIDIA在一组经过筛选的5389个样本（包括结直肠癌CRC；1574例、肝细胞癌HCC；1181例、肺癌；654例及非癌对照；1980例）中的癌症诊断和癌症类型预测性能。在CRC数据集上的10折交叉验证中，DECIDIA实现了0.980（95% CI，0.976-0.984）的受试者工作特征曲线下面积（AUROC），优于基于甲基化强度的基准方法，能够区分癌症患者与无癌对照。值得注意的是，在外部独立HCC测试集中，DECIDIA在区分HCC患者与无癌对照方面达到了0.910（95% CI，0.896-0.924）的AUROC，尽管模型训练中未使用HCC数据。在癌症类型分类方面，DECIDIA实现了0.963（95% CI，0.960-0.966）的微平均AUROC和82.8%的整体准确率（95% CI，81.8-83.9）。此外，我们从原始测序数据中提取了四个表现出癌症与对照、不同癌种间差异的序列特征。这一方法代表了利用甲硫酸盐处理的cfDNA甲基组进行液体活检，简化繁琐数据分析流程的全新范式。

Abstract

Early cancer diagnosis from bisulfite-treated cell-free DNA (cfDNA) fragments requires tedious data analytical procedures. Here, we present a deep-learning-based approach for early cancer interception and diagnosis (DECIDIA) that can achieve accurate cancer diagnosis exclusively from bisulfite-treated cfDNA sequencing fragments. DECIDIA relies on transformer-based representation learning of DNA fragments and weakly supervised multiple-instance learning for classification. We systematically evaluate the performance of DECIDIA for cancer diagnosis and cancer type prediction on a curated dataset of 5389 samples that consist of colorectal cancer (CRC; n = 1574), hepatocellular cell carcinoma (HCC; n = 1181), lung cancer (n = 654), and non-cancer control (n = 1980). DECIDIA achieved an area under the receiver operating curve (AUROC) of 0.980 (95% CI, 0.976-0.984) in 10-fold cross-validation settings on the CRC dataset by differentiating cancer patients from cancer-free controls, outperforming benchmarked methods that are based on methylation intensities. Noticeably, DECIDIA achieved an AUROC of 0.910 (95% CI, 0.896-0.924) on the externally independent HCC testing set in distinguishing HCC patients from cancer-free controls, although there was no HCC data used in model development. In the settings of cancer-type classification, we observed that DECIDIA achieved a micro-average AUROC of 0.963 (95% CI, 0.960-0.966) and an overall accuracy of 82.8% (95% CI, 81.8-83.9). In addition, we distilled four sequence signatures from the raw sequencing reads that exhibited differential patterns in cancer versus control and among different cancer types. Our approach represents a new paradigm towards eliminating the tedious data analytical procedures for liquid biopsy that uses bisulfite-treated cfDNA methylome.