一种用于缺失值的肺癌总体生存预测的深度学习方法。

A deep learning approach for overall survival prediction in lung cancer with missing values.

Original text

发表日期：2024 Jun 28

作者： Camillo Maria Caruso, Valerio Guarrasi, Sara Ramella, Paolo Soda

来源： Comput Meth Prog Bio

摘要：

在肺癌研究领域，特别是在总生存期（OS）分析中，人工智能（AI）在特定目标中发挥着至关重要的作用。鉴于医疗领域普遍存在数据缺失的问题，我们的主要目标是开发一种能够动态处理这些缺失数据的人工智能模型。此外，我们的目标是利用所有可访问的数据，通过在我们的人工智能模型中嵌入一种在其他人工智能任务中不常用的专门技术，有效地分析经历过感兴趣事件的未经审查的患者和未经历过感兴趣事件的审查患者。通过实现这些目标，我们的模型旨在为非小细胞肺癌 (NSCLC) 患者提供精确的 OS 预测，从而克服这些重大挑战。我们提出了一种在 NSCLC 背景下进行缺失值生存分析的新方法，它利用变压器架构的优势来仅考虑可用特征，而不需要任何插补策略。更具体地说，该模型通过调整其特征嵌入和屏蔽自注意力来掩盖缺失数据并充分利用可用数据，从而将转换器架构定制为表格数据。通过利用专门设计的 OS 损失，它能够解释审查和未经审查的患者，以及风险随时间的变化。我们将我们的方法与最先进的生存分析模型进行了比较采用不同的插补策略。我们使用不同的时间粒度评估了 6 年期间获得的结果，获得了 Ct 指数（C 指数的时间依赖性变体），对于 1 个月、1 年和 2 年的时间单位，分别为 71.97、77.58 和 80.72 ，分别优于所有最先进的方法，无论使用何种插补方法。结果表明，我们的模型不仅优于最先进的性能，而且还简化了存在缺失数据的情况下的分析，通过有效消除确定预测 NSCLC 患者 OS 的最合适插补策略的需要。版权所有 © 2024 作者。由 Elsevier B.V. 出版。保留所有权利。

In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges.We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time.We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used.The results show that our model not only outperforms the state-of-the-art's performance but also simplifies the analysis in the presence of missing data, by effectively eliminating the need to identify the most appropriate imputation strategy for predicting OS in NSCLC patients.Copyright © 2024 The Author(s). Published by Elsevier B.V. All rights reserved.