使用域内适应的 BERT 模型和分类层对多语言症状实体进行识别和标准化。

Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers.

Original text

发表日期：2024 Aug 28

作者： Fernando Gallego, Francisco J Veredas

来源： Database-Oxford

摘要：

由于生物医学领域可用注释的稀缺，临床自然语言处理提出了巨大的挑战，特别是当应用于低资源语言时。本文介绍了我们对与多语言临床文本中存在的症状、体征和发现相对应的临床实体的检测和标准化的贡献。为此，Biocreative VIII 会议的 SympTEMIST 共享任务中提出的三个子任务已得到解决。对于西班牙语语料库中的子任务 1 命名实体识别，采用了一种专注于在专有肿瘤学语料库上进行预训练的基于 BERT 的模型组件的方法。 SympTEMIST 的子任务 2 和 3 分别解决西班牙语和多语言语料库中的命名实体链接 (NEL)。我们处理这些子任务的方法遵循一种分类策略，该策略从通过对比学习训练的双编码器开始，为此探索了几种类似 SapBERT 的模型。为了将这种 NEL 方法应用于不同的语言，我们利用组织者提供的西班牙语特定领域医学概念的知识库来训练这些模型，并使用机器翻译工具将其翻译成其他感兴趣的语言。三个子任务中获得的结果建立了新的技术水平。因此，对于子任务 1，我们获得的精度结果为 0.804，F1 分数为 0.748，召回率为 0.699。对于子任务 2，当经过训练的双编码器后面跟着一个 WNT-softmax 分类层（该分类层使用 SNOMED-CT 项子集的嵌入平均值进行初始化）时，我们在 top-1 准确率方面获得了高达 5.5% 的性能增益。对于子任务 3，差异更加明显，当与 WNT-softmax 分类层相结合时，我们的多语言双编码器的性能优于除瑞典语之外的所有语言中分析的其他模型。因此，与最佳双编码器模型相比，葡萄牙语的 top-1 准确率提高了 13%，瑞典语的提高了 13.26%。数据库 URL：https://doi.org/10.1093/database/baae087.© 作者 2024 年。由牛津大学出版社出版。

Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.© The Author(s) 2024. Published by Oxford University Press.