RUS  ENG
Full version
JOURNALS // Computer Research and Modeling // Archive

Computer Research and Modeling, 2023 Volume 15, Issue 1, Pages 185–195 (Mi crm1053)

ENGINEERING AND TELECOMMUNICATIONS

Development of and research on machine learning algorithms for solving the classification problem in Twitter publications

I. S. Makarov, E. R. Bagantsova, P. A. Iashi, M. D. Kovaleva, R. A. Gorbachev

Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141701, Russia

Abstract: Posts on social networks can both predict the movement of the financial market, and in some cases even determine its direction. The analysis of posts on Twitter contributes to the prediction of cryptocurrency prices. The specificity of the community is represented in a special vocabulary. Thus, slang expressions and abbreviations are used in posts, the presence of which makes it difficult to vectorize text data, as a result of which preprocessing methods such as Stanza lemmatization and the use of regular expressions are considered. This paper describes created simplest machine learning models, which may work despite such problems as lack of data and short prediction timeframe. A word is considered as an element of a binary vector of a data unit in the course of the problem of binary classification solving. Basic words are determined according to the frequency analysis of mentions of a word. The markup is based on Binance candlesticks with variable parameters for a more accurate description of the trend of price changes. The paper introduces metrics that reflect the distribution of words depending on their belonging to a positive or negative classes. To solve the classification problem, we used a dense model with parameters selected by Keras Tuner, logistic regression, a random forest classifier, a naive Bayesian classifier capable of working with a small sample, which is very important for our task, and the k-nearest neighbors method. The constructed models were compared based on the accuracy metric of the predicted labels. During the investigation we recognized that the best approach is to use models which predict price movements of a single coin. Our model deals with posts that mention LUNA project, which no longer exist. This approach to solving binary classification of text data is widely used to predict the price of an asset, the trend of its movement, which is often used in automated trading.

Keywords: ñryptocurrency, Twitter, machine learning, natural language processing, vectorization, dense model, logistic regression, random forest classifier, KNN, naive Bayes classifier.

UDC: 519.8

Received: 01.11.2022
Accepted: 23.12.2022

Language: English

DOI: 10.20537/2076-7633-2023-15-1-185-195



© Steklov Math. Inst. of RAS, 2024