RUS  ENG
Full version
JOURNALS // Modelirovanie i Analiz Informatsionnykh Sistem // Archive

Model. Anal. Inform. Sist., 2022 Volume 29, Number 4, Pages 316–332 (Mi mais782)

Theory of data

Detecting mentions of green practices in social media based on text classification

A. V. Glazkovaa, O. V. Zakharovaa, A. V. Zakharova, N. N. Moskvinaa, T. R. Enikeevb, A. N. Hodyreva, V. K. Borovinskiya, I. N. Pupyshevaa

a University of Tyumen, 6 Volodarskogo str., Tyumen 625003, Russia
b Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, Russia

Abstract: The paper is devoted to the task of searching for mentions of green practices in social media texts. The relevance of this task is dictated by the need to expand existing knowledge about the use of green practices in society and the spread of existing green practices. This paper uses a text corpus consisting of the texts published on the environmental communities of the VKontakte social network. The corpus is equipped with an expert markup of the mention of nine types of green practices. As part of this work, a semi-automatic approach is proposed to the collection of additional texts to reduce the class imbalance in the corpus. The approach includes the following steps: detecting the most frequent words for each practice type; automatic collecting texts in social media that contain the detected frequent words; expert verification and filtering of collected texts. The four machine learning models are compared to find the mentions of green practices on the two variants of the corpus: original and augmented using the proposed approach. Among the listed models, the highest averaged F1-score (81.32%) was achieved by Conversational RuBERT fine-tuned on the augmented corpus. Conversational RuBERT model was chosen for the implementation of the application prototype. The main function of the prototype is to detect the presence of the mention of nine types of green practices in the text. The prototype is implemented in the form of the Telegram chatbot.

Keywords: text classification, social network analysis, machine learning, BERT, green practices, natural language processing.

UDC: 004.912

MSC: 68T50

Received: 06.10.2022
Revised: 11.11.2022
Accepted: 16.11.2022

DOI: 10.18255/1818-1015-2022-4-316-332



© Steklov Math. Inst. of RAS, 2024