大型语言模型在利用细胞游离DNA末端序列特征进行癌症高精度诊断中的应用
Large language model produces high accurate diagnosis of cancer from end-motif profiles of cell-free DNA
DOI 原文链接
用sci-hub下载
如无法下载,请从 Sci-Hub 选择可用站点尝试。
影响因子:7.7
分区:生物学2区 / 数学与计算生物学1区 生化研究方法2区
发表日期:2024 Jul 25
作者:
Jilei Liu, Hongru Shen, Kexin Chen, Xiangchun Li
DOI:
10.1093/bib/bbae430
摘要
经过指令调优的大型语言模型(LLMs)展现出极高的与人类意图对齐能力。我们提出一种基于LLM的模型——iLLMAC(基于指令调优的癌症检测模型),可以利用细胞游离DNA(cfDNA)末端序列特征检测癌症。在来自1135名癌症患者和1106名对照者的血浆cfDNA测序数据基础上开发,iLLMAC在三组数据中实现了癌症诊断的受试者工作特征面积(AUROC)为0.866(95%置信区间CI,0.773-0.959),利用16个末端序列特征检测肝细胞癌(HCC)的AUROC为0.924(CI,0.841-1.0)。随着特征数增加,性能提升至癌症诊断的0.886(CI,0.794-0.977)和HCC检测的0.956(CI,0.89-1.0),使用64个末端序列特征。在外部测试集中,64个末端序列特征的癌症诊断AUROC达到0.912(CI,0.849-0.976),HCC检测AUROC为0.938(CI,0.885-0.992),显著优于其他基准方法。此外,iLLMAC在甲硫氨酸全基因组甲基化和5-羟甲基胞嘧啶测序数据集上也表现出高分类性能。我们的研究强调了基于LLM的指令调优在cfDNA癌症检测中的有效性。
Abstract
Instruction-tuned large language models (LLMs) demonstrate exceptional ability to align with human intentions. We present an LLM-based model-instruction-tuned LLM for assessment of cancer (iLLMAC)-that can detect cancer using cell-free deoxyribonucleic acid (cfDNA) end-motif profiles. Developed on plasma cfDNA sequencing data from 1135 cancer patients and 1106 controls across three datasets, iLLMAC achieved area under the receiver operating curve (AUROC) of 0.866 [95% confidence interval (CI), 0.773-0.959] for cancer diagnosis and 0.924 (95% CI, 0.841-1.0) for hepatocellular carcinoma (HCC) detection using 16 end-motifs. Performance increased with more motifs, reaching 0.886 (95% CI, 0.794-0.977) and 0.956 (95% CI, 0.89-1.0) for cancer diagnosis and HCC detection, respectively, with 64 end-motifs. On an external-testing set, iLLMAC achieved AUROC of 0.912 (95% CI, 0.849-0.976) for cancer diagnosis and 0.938 (95% CI, 0.885-0.992) for HCC detection with 64 end-motifs, significantly outperforming benchmarked methods. Furthermore, iLLMAC achieved high classification performance on datasets with bisulfite and 5-hydroxymethylcytosine sequencing. Our study highlights the effectiveness of LLM-based instruction-tuning for cfDNA-based cancer detection.