N. Z. Valishina, S. A. Ilyuhin, A. V. Sheshkus, V. L. Arlazarov, “Automatic training data filtering for errors removing and improving the quality of the final neural network”, Informatsionnye Tekhnologii i Vychslitel'nye Sistemy, 2022, Issue 3,Pages <nobr>35

INTELLIGENT SYSTEMS AND TECHNOLOGIES

Automatic training data filtering for errors removing and improving the quality of the final neural network

N. Z. Valishina^ab, S. A. Ilyuhin^bc, A. V. Sheshkus^bcd, V. L. Arlazarov

^a Lomonosov Moscow State University, Prosp. 60-letiya Oktyabrya, 9, Moscow, 117312, Russia
^b Smart Engines Service LLC
^c Moscow Institute of Physics and Technology (State University), Prosp. 60-letiya Oktyabrya, 9, Moscow, 117312, Russia
^d Federal Research Center "Computer Science and Control" of RAS, Prosp. 60-letiya Oktyabrya, 9, Moscow, 117312, Russia

Abstract: Real-world data are often dirty. In most cases it negatively affects the accuracy of the model trained on such data. Supervised data correction is an expensive and time-consuming procedure. So one of the possible ways to solve this problem is to automate the cleaning process. In this paper, we consider such a preprocessing technique for improving the quality of the trained network as automatic cleaning of training data. The proposed iterative method is based on the assumption that the polluted data are most likely located farther away from the median of the class. It includes detection and subsequent removal of the noisy data from a training set. Experiments on a generated synthetic dataset demonstrated that this method gives good results and allows to clean up the data even at high levels of pollution and significantly improve the quality of the classifier.

Keywords: data cleaning, outlier(s) detection, mislabels, classifier, siamese neural network.

Language: English

DOI: 10.14357/20718632220304