Abstract:
This paper compares the pre-training of a transformer on natural language texts and on sentences of a synthetic pseudolanguage. The artificial texts were automatically generated according to the rules we wrote in a context-free grammar. The results of fine-tuning to complete tasks of the RussianSuperGLUE project statistically reliably showed that the models had the same scores. That is, we can consider that the use of artificial texts provides an advantage in the AI safety due to the ability to completely control the composition of the dataset. We can also say that at the pre-training stage of a model like RoBERTa, it is enough to learn to recognize only the syntactic and morphological patterns of the language, which can be successfully created in a fairly simple way, such as a context-free grammar.
Keywords:deep learning methods, transformers, pre-training, automatic text generation, language models, synthetic data, AI safety.
UDC:
004.8
Presented:A. L. Semenov Received: 03.09.2023 Revised: 15.09.2023 Accepted: 24.10.2023