Abstract:
Due to the significant increase in the number of scientific publications and reports, the task of processing and analyzing them becomes complicated and labor-intensive. Transformer language models pretrained on large collections of texts can be used to obtain high-quality solutions for a variety of tasks related to textual data analysis. For scientific texts in English, there are language models, such as SciBERT [1] and its modification SPECTER [2], but they do not support the Russian language, because Russian texts are few in the training set. Moreover, only English is supported by the SciDocs benchmark, which is used to evaluate the performance of language models for scientific texts. The proposed ruSciBERT model will make it possible to solve a wide variety of tasks related to analysis of scientific texts in Russian. Moreover, it is supplemented with the ruSciDocs benchmark for evaluating the performance of language models as applied to these tasks.