RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2023 Volume 514, Number 2, Pages 262–269 (Mi danma471)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Accessible Russian large language models: open-sourced models and instructive datasets for commercial applications

D. Kosenkoab, Yu. Kuratovabc, D. Zharikovab

a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
b DeepPavlov, Moscow, Russia
c Artificial Intelligence Research Institute, Moscow, Russia

Abstract: This paper presents an approach to developing and fine-tuning large language models for Russian that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B were used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources were used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which were then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language dataset are released open-source with licenses suitable for commercial use.

Keywords: large language models, language models, language models in Russian.

UDC: 0004.8

Presented: A. L. Semenov
Received: 31.08.2023
Revised: 30.09.2023
Accepted: 15.10.2023

DOI: 10.31857/S2686954323602063


 English version:
Doklady Mathematics, 2023, 108:suppl. 2, S393–S398

Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024