Abstract:
The possibility of forecasting the churn of customers based
on the data
of the Russian ISP are considered. The basic stages and approaches to the preliminary
processing of the texts of operators’ comments have been determined. It’s offered to
use
classification algorithms such as the logistic regression, $k$-nearest neighbors method,
the gradient
boosting, the naive Bayesian algorithm. As a sample, an array of input data from 23
features
of 380 000 subscribers was formed. Typos are correcting with using the Dahmerau — Levenshtein
distance and lemmatizing of the textual information, and then they are converted into a feature
vector
using the TF-IDF method and are added to the model. The main approaches of
categorical features coding are determined. The forecast models are constructed. Comparison of
the results of the study with different classifiers is made and conclusions are drawn.