对机器学习模型进行配对评估，以揭示混杂因素和异常值的影响效果。

Paired evaluation of machine-learning models characterizes effects of confounders and outliers.

Original text

发表日期：2023 Aug 11

作者： Maulik K Nariya, Caitlin E Mills, Peter K Sorger, Artem Sokolov

来源： Alzheimers & Dementia

摘要：

机器学习模型的真实精确度是一种无法直接观察到的群体水平统计量。在实践中，预测性能通常通过一个或多个测试数据集进行估计，而这个估计的准确度很大程度上取决于测试集能否很好地代表所有可能的未知数据集。在这里，我们描述了一种称为配对评估的简单、稳健的方法，用于评估小样本生物学和临床研究中机器学习模型的性能。我们使用这种方法来评估乳腺癌细胞系中药物反应的预测器以及阿尔茨海默病患者的疾病严重程度的预测器，结果显示选择不同的测试数据可以导致性能估计的变化高达20%。我们展示了配对评估可以帮助识别异常值，在已知混杂因素存在的情况下提高性能估计的准确性，并且可以在比较机器学习模型时赋予统计学意义。© 2023 作者

The true accuracy of a machine-learning model is a population-level statistic that cannot be observed directly. In practice, predictor performance is estimated against one or more test datasets, and the accuracy of this estimate strongly depends on how well the test sets represent all possible unseen datasets. Here we describe paired evaluation as a simple, robust approach for evaluating performance of machine-learning models in small-sample biological and clinical studies. We use the method to evaluate predictors of drug response in breast cancer cell lines and of disease severity in patients with Alzheimer's disease, demonstrating that the choice of test data can cause estimates of performance to vary by as much as 20%. We show that paired evaluation makes it possible to identify outliers, improve the accuracy of performance estimates in the presence of known confounders, and assign statistical significance when comparing machine-learning models.© 2023 The Author(s).