D. Kosenko, Yu. Kuratov, D. Zharikova, “Accessible Russian large language models: open-sourced models and instructive datasets for commercial applications”, Dokl. RAN. Math. Inf. Proc. Upr., 2023, Volume 514, Number 2,Pages <nobr>262

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Accessible Russian large language models: open-sourced models and instructive datasets for commercial applications

D. Kosenko^ab, Yu. Kuratov^abc, D. Zharikova^b

^a Moscow Institute of Physics and Technology (National Research University), Moscow, Russia
^b DeepPavlov, Moscow, Russia
^c Artificial Intelligence Research Institute, Moscow, Russia

Abstract: This paper presents an approach to developing and fine-tuning large language models for Russian that are capable of following instructions across domains. As base models, XGLM-4.5B, LLaMA-1 7B, LLaMA-1 13B, LLaMA-2 7B, LLaMA-2 13B, and ruGPT-3.5 13B were used. This work compares two main fine-tuning techniques: fine-tuning all model parameters and fine-tuning using LoRA layers. To create a fine-tuning dataset, several open English language data sources were used, including Databricks Dolly 15k, OpenAssistant Conversations Dataset (OASST1), chip2-instruct-alpha-v6a-1, which were then translated into Russian using the WMT21 En-X model. This work shows that the quality of the instructions provided for training significantly affects the ability to solve tasks on automatic quality metrics like MT-BENCH and MMLU. At the same time, the quality of models trained on the dataset collected as part of this work with a commercial license achieves comparable results to models fine-tuned on the Saiga dataset with a limited license. The fine-tuned language models and collected Russian language dataset are released open-source with licenses suitable for commercial use.

Keywords: large language models, language models, language models in Russian.

UDC: 0004.8

Presented: A. L. Semenov
Received: 31.08.2023
Revised: 30.09.2023
Accepted: 15.10.2023

DOI: 10.31857/S2686954323602063