N. A. Gerasimenko, A. S. Chernyavsky, M. A. Nikiforova, “ruSciBERT: a transformer language model for obtaining semantic embeddings of scientific texts in russian”, Dokl. RAN. Math. Inf. Proc. Upr., 2022, Volume 508,Pages <nobr>104

This article is cited in 3 papers

ADVANCED STUDIES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

ruSciBERT: a transformer language model for obtaining semantic embeddings of scientific texts in russian

N. A. Gerasimenko, A. S. Chernyavsky, M. A. Nikiforova

Sberbank, Moscow

Abstract: Due to the significant increase in the number of scientific publications and reports, the task of processing and analyzing them becomes complicated and labor-intensive. Transformer language models pretrained on large collections of texts can be used to obtain high-quality solutions for a variety of tasks related to textual data analysis. For scientific texts in English, there are language models, such as SciBERT [1] and its modification SPECTER [2], but they do not support the Russian language, because Russian texts are few in the training set. Moreover, only English is supported by the SciDocs benchmark, which is used to evaluate the performance of language models for scientific texts. The proposed ruSciBERT model will make it possible to solve a wide variety of tasks related to analysis of scientific texts in Russian. Moreover, it is supplemented with the ruSciDocs benchmark for evaluating the performance of language models as applied to these tasks.

Keywords: language model, semantic representations, SciBERT, SciDocs.

UDC: 004.8

Presented: A. L. Semenov
Received: 28.10.2022
Revised: 28.10.2022
Accepted: 01.11.2022

DOI: 10.31857/S2686954322070074