仅依赖COVID-19相关文章的元数据是否足以用于多标签主题分类任务?
Is metadata of articles about COVID-19 enough for multilabel topic classification task?
DOI 原文链接
用sci-hub下载
如无法下载,请从 Sci-Hub 选择可用站点尝试。
影响因子:3.6
分区:生物学4区 / 数学与计算生物学4区
发表日期:2024 Oct 21
作者:
Shuo Xu, Yuefu Zhang, Liang Chen, Xin An
DOI:
10.1093/database/baae106
摘要
随着COVID-19相关文献数量的不断增加,手动整理和进行LitCovid的多标签主题分类面临巨大挑战。为此,本研究开发了一种新颖的多标签主题分类框架,该框架考虑了主题标签之间的相关性和不平衡性,同时增强了预训练模型的能力。在该框架的帮助下,本研究旨在回答以下问题:关于COVID-19的文章的全文、MeSH(医学主题词)和生物实体是否比元数据(标题、摘要、关键词和期刊名)编码了更多具有判别力的信息?通过在我们丰富版本的BC7-LitCovid语料库和癌症标志物语料库上进行大量实验,可以得出以下结论:我们的框架表现出优越的性能和鲁棒性。关于COVID-19的科学出版物的元数据包含有价值的信息,有助于多标签主题分类。与生物实体相比,全文和MeSH可以进一步提升我们框架的多标签主题分类性能,但提升幅度非常有限。数据库链接:https://github.com/pzczxs/Enriched-BC7-LitCovid。
Abstract
The ever-increasing volume of COVID-19-related articles presents a significant challenge for the manual curation and multilabel topic classification of LitCovid. For this purpose, a novel multilabel topic classification framework is developed in this study, which considers both the correlation and imbalance of topic labels, while empowering the pretrained model. With the help of this framework, this study devotes to answering the following question: Do full texts, MeSH (Medical Subject Heading), and biological entities of articles about COVID-19 encode more discriminative information than metadata (title, abstract, keyword, and journal name)? From extensive experiments on our enriched version of the BC7-LitCovid corpus and Hallmarks of Cancer corpus, the following conclusions can be drawn. Our framework demonstrates superior performance and robustness. The metadata of scientific publications about COVID-19 carries valuable information for multilabel topic classification. Compared to biological entities, full texts and MeSH can further enhance the performance of our framework for multilabel topic classification, but the improved performance is very limited. Database URL: https://github.com/pzczxs/Enriched-BC7-LitCovid.