一种具有降维和尺寸缩减方法的深度机器学习新算法在甲状腺癌诊断中的应用—应对随机缺失数据
A novel deep machine learning algorithm with dimensionality and size reduction approaches for feature elimination: thyroid cancer diagnoses with randomly missing data
DOI 原文链接
用sci-hub下载
如无法下载,请从 Sci-Hub 选择可用站点尝试。
影响因子:7.7
分区:生物学2区 / 数学与计算生物学1区 生化研究方法2区
发表日期:2024 May 23
作者:
Onder Tutsoy, Hilmi Erdem Sumbul
DOI:
10.1093/bib/bbae344
keywords:
big data; deep learning; dimension reduction; feature selection; missing data; size reduction; thyroid cancer
摘要
尽管近年来开发了大量检测工具,但甲状腺癌的发病率仍在持续上升。由于缺乏统一、标准化的诊断流程,临床医生需要进行多项检测。这一过程产生了多维的大数据,而缺乏统一方法导致随机分布的缺失(稀疏)数据,成为机器学习算法面临的重大挑战。本文旨在开发一种高效、准确的深度学习算法,以用于甲状腺癌的诊断。为此,针对学习过程中出现的随机缺失数据引发的奇异性问题,提出了结合内部和目标相似性的方法进行降维,筛选最具信息量的输入数据集。同时,采用层次聚类算法进行尺寸缩减,剔除较为相似的数据样本。训练了四种机器学习算法,并用未见过的数据进行验证,评估其泛化能力和鲁棒性。结果显示,算法在训练集上的准确率达100%,在未见新数据上达到83%。同时,比较了不同算法在相同条件下的计算效率。
Abstract
Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.