T. E. Gorbacheva, I. Yu. Bondarenko, “Safe pre-training of deep language models in a synthetic pseudo-language”, Dokl. RAN. Math. Inf. Proc. Upr., 2023, Volume 514, Number 2,Pages <nobr>375

SPECIAL ISSUE: ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING TECHNOLOGIES

Safe pre-training of deep language models in a synthetic pseudo-language

T. E. Gorbacheva, I. Yu. Bondarenko

Novosibirsk State University, Novosibirsk, Russian Federation

Abstract: This paper compares the pre-training of a transformer on natural language texts and on sentences of a synthetic pseudolanguage. The artificial texts were automatically generated according to the rules we wrote in a context-free grammar. The results of fine-tuning to complete tasks of the RussianSuperGLUE project statistically reliably showed that the models had the same scores. That is, we can consider that the use of artificial texts provides an advantage in the AI safety due to the ability to completely control the composition of the dataset. We can also say that at the pre-training stage of a model like RoBERTa, it is enough to learn to recognize only the syntactic and morphological patterns of the language, which can be successfully created in a fairly simple way, such as a context-free grammar.

Keywords: deep learning methods, transformers, pre-training, automatic text generation, language models, synthetic data, AI safety.

UDC: 004.8

Presented: A. L. Semenov
Received: 03.09.2023
Revised: 15.09.2023
Accepted: 24.10.2023

DOI: 10.31857/S2686954323601860