Abstract:
The paper is devoted to the task of searching for mentions of green practices in social media texts. The relevance of this task is dictated by the need to expand existing knowledge about the use of green practices in society and the spread of existing green practices. This paper uses a text corpus consisting of the texts published on the environmental communities of the VKontakte social network. The corpus is equipped with an expert markup of the mention of nine types of green practices. As part of this work, a semi-automatic approach is proposed to the collection of additional texts to reduce the class imbalance in the corpus. The approach includes the following steps: detecting the most frequent words for each practice type; automatic collecting texts in social media that contain the detected frequent words; expert verification and filtering of collected texts. The four machine learning models are compared to find the mentions of green practices on the two variants of the corpus: original and augmented using the proposed approach. Among the listed models, the highest averaged F1-score (81.32%) was achieved by Conversational RuBERT fine-tuned on the augmented corpus. Conversational RuBERT model was chosen for the implementation of the application prototype. The main function of the prototype is to detect the presence of the mention of nine types of green practices in the text. The prototype is implemented in the form of the Telegram chatbot.
Keywords:text classification, social network analysis, machine learning, BERT, green practices, natural language processing.