Abstract:
Large language models (LLMs) pretrained on English-centered corpora have biases and perform sub-optimally on other natural languages. Adaptation of LLMs vocabulary provides a resource-efficient way to improve the quality of a pretrained model. Previously proposed adaptation techniques focus on performance (accuracy) and size metrics (fertility), ignoring other aspects in comparison, such as inference latency, compute resources for adaptation, and catastrophic forgetting. This paper fills this gap by making a multi-aspect comparison of several tokenizer adaptation techniques for a fixed decoder-based LLM. In our experiments, we focus only on Russian for the sake of results clarity in the limited resources circumstances. Under controlled conditions, we compare three methods. The work establishes new baselines for tokenizer adaptation in Russian and demonstrates a computationally effective way to enhance the performance, reducing GPU hours up to 2–3 times.
Keywords:large language models, tokenizer adaptation, text generation quality, text generation speed.