Abstract:
This paper investigates the problem of data imbalance. The types of imbalance and a class of problems arising during training of machine learning models are described. A review of machine learning models exhibiting varying degrees of sensitivity to imbalanced data is provided. A description of groups of methods used for balancing classes in a training set is provided. In the context of methods for synthetic generation of minority class data, an algorithm for data synthesis using CLIQUE subspace clustering is considered. A modified version of the algorithm is proposed that uses a genetic algorithm to determine the optimal values of the CLIQUE parameters. This approach allows for the automation of the parameter tuning process and improves the algorithm's performance under conditions of data imbalance. A study is conducted demonstrating the varying effectiveness of minority class data generation methods depending on the type of imbalance and the selected machine learning model. The obtained results confirm the importance of taking into account the subspace structure of data when synthesizing new examples for classification problems with imbalanced samples.
Keywords:machine learning, training set, training and retraining, dominant class, minority class, resampling methods.