USNAP: 快速独特密集区域检测及其在肺癌中的应用。
USNAP: Fast unique dense region detection and its application to lung cancer.
发表日期:2023 Aug 01
作者:
Serene W H Wong, Chiara Pastrello, Max Kotlyar, Christos Faloutsos, Igor Jurisica
来源:
BIOINFORMATICS
摘要:
许多现实世界的问题可以建模为带有注释的图形。由于这些图形规模大,拓扑变化多样,并且节点/边缘注释各不相同,因此需要可扩展的图形算法来从这些数据中提取可操作的信息。当这些图形随着时间的变化而改变时,它们形成了动态图形,并且有可能在不同时间点找到模式。在本文中,我们介绍了一种可扩展的算法,用于在动态图形中找到不同时间点上的独特密集区域。这样的算法在许多不同领域都有应用,包括生物学、金融学和社会学领域。
本文的三个重要贡献是:首先,我们设计了一种可扩展的算法 USNAP,能够在给定动态图形的时间戳下有效地识别独特的密集子图。重要的是,USNAP 在贪婪算法的每一步中提供了密度测量的下界。其次,通过在真实数据上验证 USNAP 获得的洞察和理解显示了其有效性。虽然 USNAP 是与领域无关的,但我们将其应用于四个非小细胞肺癌 (NSCLC) 基因表达数据集。NSCLC 的阶段被建模为动态图形,并作为 USNAP 的输入。通过通路富集分析和来自文献的全面解释,显示出 USNAP 识别了癌症进程不同阶段的生物相关机制。第三,USNAP 是可扩展的,其时间复杂度为 O(m + mclognc + nclognc),其中 m 是边的数量,n 是动态图形中顶点的数量;mc 是折叠图形中边的数量,nc 是折叠图形中顶点的数量。
USNAP 的代码可在 https://www.cs.utoronto.ca/∼juris/data/USNAP22 获取。补充数据可在 Bioinformatics online 上获取。© The Author(s) 2023. Published by Oxford University Press.
Many real-world problems can be modeled as annotated graphs. Scalable graph algorithms that extract actionable information from such data are in demand since these graphs are large, varying in topology, and have diverse node/edge annotations. When these graphs change over time they create dynamic graphs, and open the possibility to find patterns across different time points. In this paper, we introduce a scalable algorithm that finds unique dense regions across time points in dynamic graphs. Such algorithms have applications in many different areas, including the biological, financial, and social domains.There are three important contributions to this manuscript. First, we designed a scalable algorithm, USNAP, to effectively identify dense subgraphs that are unique to a time stamp given a dynamic graph. Importantly, USNAP provides a lower bound of the density measure in each step of the greedy algorithm. Second, insights and understanding obtained from validating USNAP on real data show its effectiveness. While USNAP is domain independent, we applied it to four non-small cell lung cancer (NSCLC) gene expression datasets. Stages in NSCLC were modeled as dynamic graphs, and input to USNAP. Pathway enrichment analyses and comprehensive interpretations from literature show that USNAP identified biologically relevant mechanisms for different stages of cancer progression. Third, USNAP is scalable, and has a time complexity of O(m + mclognc + nclognc), where m is the number of edges, and n is the number of vertices in the dynamic graph; mc is the number of edges, and nc is the number of vertices in the collapsed graph.The code of USNAP is available at https://www.cs.utoronto.ca/∼juris/data/USNAP22.Supplementary data are available at Bioinformatics online.© The Author(s) 2023. Published by Oxford University Press.