Abstract:
In this paper, we present a study of artificial intelligence methods usage in identifying psychotypes using the MBTI typology for Russian language data. The Myers-Briggs Typology or MBTI is a system for classifying personality types and uses four dichotomies to determine personality type: extraversion/introversion, sensation/intuition, thinking/feeling, and judgment/perception. This results in 16 possible personality types. The Myers-Briggs typology has many applications in practice: hiring, increasing the effectiveness of work groups, conflict resolution, and choosing a future profession. The process of determining the personality type according to the MBTI psychological testing system is labor-intensive, requires passing an extensive test, so the task of simplifying the classification is actual. In the Internet there is data for training models based on artificial intelligence technologies — these are texts written by the user and compared with their MBTI personality type. Similar researches on revealing the connection between texts of publications and personality types have already been carried out, but not for the Russian language, which also determines the relevance of this paper. The aim of this article is to create and train a neural network that will accurately determine the MBTI personality type of authors based on the texts of forum publications. The dataset in Russian was obtained by automatic translation of English-language publications. To prepare the data for analysis by artificial intelligence methods, lemmatization and correction of class imbalance by oversampling were performed. In order to smooth the problem of class imbalance, a transition from sixteen-class classification to four classifications for each of the personality type dichotomies was also made. The study of the influence of stop-words on the final result was carried out and it was revealed that their presence has a positive effect on the determination of psychotype and accuracy of the final result. The accuracy of the model also increases when generalizing sequences of symbols characteristic of online communication: links to images and videos, emoticons, punctuation peculiarities such as long sequences of dots or exclamation marks, numbered lists, time designations and so forth. In total, more than fifty common constructions were identified. In the course of the work we compared the accuracy of different artificial intelligence methods applied to our task: naive Bayesian classifier, linear neural network, random forest, recurrent neural network, deep learning models BERT and FNet. The best accuracy of the results is obtained by using BERT model — 0.81. FNet comes second with an accuracy of 0.66, on three dichotomies out of four its results are comparable to BERT with significantly less training time. A comparison of the results obtained with those for other languages is given. The paper also presents an expert opinion on several publication texts. Out of 10 examples, the expert's opinion coincides with the neural network in 8 cases, in 2 cases it diverges; in 1 case it diverges from the original data and coincides with the neural network's predictions. This work could be improved by supplementing the dataset by downloading publications from other forums and social networks. An important improvement could be the introduction of not only translations of datasets from other languages, but also the inclusion of originally Russian-language publications in the sample. This would help to capture the specifics of online communication in Russian, which would increase the applicability of the results of this paper.