具有不平衡模式的非独立同分布数据集上的癌症分期的多模式联合学习。
Multi-Modal Federated Learning for Cancer Staging over Non-IID Datasets with Unbalanced Modalities.
发表日期:2024 Aug 28
作者:
Kasra Borazjani, Naji Khosravan, Leslie Ying, Seyyedali Hosseinalipour
来源:
IEEE TRANSACTIONS ON MEDICAL IMAGING
摘要:
通过医学图像分析使用机器学习 (ML) 进行癌症分期已经引起了各个医学学科的浓厚兴趣。当与创新的联邦学习 (FL) 框架相结合时,机器学习技术可以进一步克服与患者数据暴露相关的隐私问题。鉴于患者记录中经常存在不同的数据模式,在多模式学习框架中利用 FL 为癌症分期带来了巨大的希望。然而,现有的多模态 FL 工作通常假设所有数据收集机构都可以访问所有数据模态。这种过于简单化的方法忽略了只能访问系统内部分数据模式的机构。在这项工作中,我们引入了一种新颖的 FL 架构,其设计不仅可以适应数据样本的异质性,还可以适应跨机构数据模式固有的异质性/不均匀性。我们揭示了与 FL 系统中不同数据模式观察到的不同收敛速度相关的挑战。随后,我们提出了一种解决方案,通过设计专为多模态 FL 定制的分布式梯度混合和邻近感知客户端权重策略来应对这些挑战。为了展示我们方法的优越性,我们使用癌症基因组图谱计划 (TCGA) 数据湖进行实验,考虑不同的癌症类型和三种数据模式:mRNA 序列、组织病理学图像数据和临床信息。我们的结果进一步揭示了跨机构基于类别与基于类型的异质性对模型性能的影响和严重性,这拓宽了多模态 FL 文献中数据异质性概念的视角。
The use of machine learning (ML) for cancer staging through medical image analysis has gained substantial interest across medical disciplines. When accompanied by the innovative federated learning (FL) framework, ML techniques can further overcome privacy concerns related to patient data exposure. Given the frequent presence of diverse data modalities within patient records, leveraging FL in a multi-modal learning framework holds considerable promise for cancer staging. However, existing works on multi-modal FL often presume that all data-collecting institutions have access to all data modalities. This oversimplified approach neglects institutions that have access to only a portion of data modalities within the system. In this work, we introduce a novel FL architecture designed to accommodate not only the heterogeneity of data samples, but also the inherent heterogeneity/non-uniformity of data modalities across institutions. We shed light on the challenges associated with varying convergence speeds observed across different data modalities within our FL system. Subsequently, we propose a solution to tackle these challenges by devising a distributed gradient blending and proximity-aware client weighting strategy tailored for multi-modal FL. To show the superiority of our method, we conduct experiments using The Cancer Genome Atlas program (TCGA) datalake considering different cancer types and three modalities of data: mRNA sequences, histopathological image data, and clinical information. Our results further unveil the impact and severity of class-based vs type-based heterogeneity across institutions on the model performance, which widens the perspective to the notion of data heterogeneity in multi-modal FL literature.