RUS  ENG
Full version
JOURNALS // Doklady Rossijskoj Akademii Nauk. Mathematika, Informatika, Processy Upravlenia // Archive

Dokl. RAN. Math. Inf. Proc. Upr., 2025 Volume 527, Pages 320–331 (Mi danma690)

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

A multi-aspect evaluation of tokenizer adaptation methods for large language models on Russian

G. D. Andriushchenkoa, M. È. Godunovaa, V. V. Ivanovab, D. S. Kuzmina, A. A. Parinova, A. Yu. Schenikovac, E. V. Zhemchuzhinaa

a National Research University Higher School of Economics, Moscow
b Innopolis University
c MTS Web Services, Moscow

Abstract: Large language models (LLMs) pretrained on English-centered corpora have biases and perform sub-optimally on other natural languages. Adaptation of LLMs vocabulary provides a resource-efficient way to improve the quality of a pretrained model. Previously proposed adaptation techniques focus on performance (accuracy) and size metrics (fertility), ignoring other aspects in comparison, such as inference latency, compute resources for adaptation, and catastrophic forgetting. This paper fills this gap by making a multi-aspect comparison of several tokenizer adaptation techniques for a fixed decoder-based LLM. In our experiments, we focus only on Russian for the sake of results clarity in the limited resources circumstances. Under controlled conditions, we compare three methods. The work establishes new baselines for tokenizer adaptation in Russian and demonstrates a computationally effective way to enhance the performance, reducing GPU hours up to 2–3 times.

Keywords: large language models, tokenizer adaptation, text generation quality, text generation speed.

UDC: 00.6:004.89

Received: 21.08.2025
Accepted: 29.09.2025

DOI: 10.7868/S2686954325070288



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2025