研究动态
Articles below are published ahead of final publication in an issue. Please cite articles in the following format: authors, (year), title, journal, DOI.

估计排序的系统发育树空间的平均值。

Estimating the mean in the space of ranked phylogenetic trees.

发表日期:2024 Aug 23
作者: Lars Berling, Lena Collienne, Alex Gavryushkin
来源: BIOINFORMATICS

摘要:

从表型和分子测序数据重建生物实体(例如基因、细胞、生物体、种群和物种)的进化历史是许多生物学、古生物学和生物医学学科的核心。通常,由于数据的不确定性和不完整性,真实的进化历史(系统发育)很难估计。统计建模方法通过引入和研究所有可能的进化历史的概率分布来解决这个问题,但也可能由于错误指定而引入不确定性。在实践中,通常通过采样来部署计算方法来学习这些分布。然而,这种方法从根本上具有挑战性,因为它需要在系统发育树空间(或树空间)上设计和实现各种统计方法。尽管在树空间上开发统计数据的问题已在文献中受到广泛关注,并且已经取得了许多突破,它在很大程度上仍然没有解决。解决这个问题的挑战有两个:树空间具有非平凡的、常常违反直觉的几何形状,这意味着许多经典的欧几里得统计数据不能立即应用;许多具有良好统计特性的树空间参数化在计算上是困难的,因此它们不能用于数据分析。因此,没有单一的传统方法可以估计任何树空间中最基本的统计数据(例如均值和方差),并且在实践中使用了各种启发式方法。尽管存在许多树总结方法来根据树空间的几何形状来近似概率分布的平均值,并且这个想法在理论上有希望,但没有任何尝试产生总结树样本的实用方法。在本文中,我们提出了一种树总结方法以及我们选择的树空间的有用属性,同时关注其对真实数据集的系统发育分析的影响。我们进行了广泛的基准研究,并证明我们的方法在许多重要的“质量”统计数据方面优于当前最流行的方法。此外,我们将我们的方法应用于从癌症进化到语言学的三个经验数据集,并在所有数据集中找到了对相应进化问题的新见解。因此,我们得出的结论是,该树空间是一个很有前途的候选者,可以作为分析开发系统发育树统计数据的基础,以及用于进化数据分析的新计算工具。可以在 https://github.com/bioDS/Centroid 上获得实现。 -代码。补充数据可在生物信息学在线获取。© 作者 2024。由牛津大学出版社出版。
Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace).Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is two-fold: A treespace has non-trivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrisations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarising tree samples.In this paper we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important "quality" statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses.An implementation is available at https://github.com/bioDS/Centroid-Code.Supplementary data are available at Bioinformatics online.© The Author(s) 2024. Published by Oxford University Press.