Abstract:
Posts on social networks can both predict the movement of the financial market, and in some cases even determine
its direction. The analysis of posts on Twitter contributes to the prediction of cryptocurrency prices. The specificity of the
community is represented in a special vocabulary. Thus, slang expressions and abbreviations are used in posts, the presence
of which makes it difficult to vectorize text data, as a result of which preprocessing methods such as Stanza lemmatization
and the use of regular expressions are considered. This paper describes created simplest machine learning models, which may
work despite such problems as lack of data and short prediction timeframe. A word is considered as an element of a binary
vector of a data unit in the course of the problem of binary classification solving. Basic words are determined according to
the frequency analysis of mentions of a word. The markup is based on Binance candlesticks with variable parameters for
a more accurate description of the trend of price changes. The paper introduces metrics that reflect the distribution of words
depending on their belonging to a positive or negative classes. To solve the classification problem, we used a dense model
with parameters selected by Keras Tuner, logistic regression, a random forest classifier, a naive Bayesian classifier capable
of working with a small sample, which is very important for our task, and the k-nearest neighbors method. The constructed
models were compared based on the accuracy metric of the predicted labels. During the investigation we recognized that the
best approach is to use models which predict price movements of a single coin. Our model deals with posts that mention
LUNA project, which no longer exist. This approach to solving binary classification of text data is widely used to predict the
price of an asset, the trend of its movement, which is often used in automated trading.
Keywords:ñryptocurrency, Twitter, machine learning, natural language processing, vectorization,
dense model, logistic regression, random forest classifier, KNN, naive Bayes classifier.