RUS  ENG
Full version
JOURNALS // Proceedings of the Institute for System Programming of the RAS // Archive

Proceedings of ISP RAS, 2022 Volume 34, Issue 4, Pages 187–200 (Mi tisp713)

Methods and techniques to automatic entity linking in Russian

A. A. Mezentsevaab, E. P. Bruchesab, T. V. Baturaa

a A.P. Ershov Institute of Informatics Systems, Siberian Branch of the Russian Academy of Sciences
b Novosibirsk State University

Abstract: Nowadays, there is a growing interest in solving NLP tasks using external knowledge storage, for example, in information retrieval, question-answering systems, dialogue systems, etc. Thus it is important to establish relations between entities in the processed text and a knowledge base. This article is devoted to entity linking, where Wikidata is used as an external knowledge base. We consider scientific terms in Russian as entities. Traditional entity linking system has three stages: entity recognition, candidates (from knowledge base) generation, and candidate ranking. Our system takes raw text with the defined terms in it as input. To generate candidates we use string match between terms in the input text and entities from Wikidata. The candidate ranking stage is the most complicated one because it requires semantic information. Several experiments for the candidate ranking stage with different models were conducted, including the approach based on cosine similarity, classical machine learning algorithms, and neural networks. Also, we extended the RUSERRC dataset, adding manually annotated data for model training. The results showed that the approach based on cosine similarity leads to better results compared to others and doesn’t require manually annotated data. The dataset and system are open-sourced and available for other researchers.

Keywords: entity linking, knowledge base, scientific terms

DOI: 10.15514/ISPRAS-2022-34(4)-13



© Steklov Math. Inst. of RAS, 2024