RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2024 Volume 520, Number 2, Pages 284–294 (Mi danma607)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

RuSciBench: open benchmark for Russian and English scientific document representations

A. C. Vatolina, N. A. Gerasimenkoabc, A. O. Yaninad, K. V. Vorontsovadc

a Federal Research Center "Computer Science and Control" of Russian Academy of Sciences, Moscow, Russia
b SberAI, Moscow, Russia
c Artificial Intelligence Institute M. V. Lomonosov Moscow State University, Moscow, Russia
d Moscow Institute of Physics and Technology, Dolgoprudny, Moscow region, Russia

Abstract: Sharing scientific knowledge in the community is an important endeavor. However, most papers are written in English, which makes dissemination of knowledge in countries where English is not spoken by the majority of people harder. Nowadays, machine translation and language models may help to solve this problem, but it is still complicated to train and evaluate models in languages other than English with no or little data in the required language. To address this, we propose the first benchmark for evaluating models on scientific texts in Russian. It consists of papers from Russian electronic library of scientific publications. We also present a set of tasks which can be used to fine-tune various models on our data and provide a detailed comparison between state-of-the-art models on our benchmark.

Keywords: dataset collection, benchmarking, large language models (LLM), LLM evaluation, representation learning.

UDC: 004.048

Received: 27.09.2024
Accepted: 02.10.2024

DOI: 10.31857/S2686954324700644


 English version:
Doklady Mathematics, 2024, 110:suppl. 1, S251–S260

Bibliographic databases:


© Steklov Math. Inst. of RAS, 2025