RUS  ENG
Full version
JOURNALS // Informatika i Ee Primeneniya [Informatics and its Applications] // Archive

Inform. Primen., 2017 Volume 11, Issue 3, Pages 60–72 (Mi ia486)

Improving classification quality for the task of finding intrinsic plagiarism

I. O. Molybogab, A. P. Motrenkoa, V. V. Strijovc

a Moscow Institute of Physics and Technology, 9 Institutskiy Per., Dolgoprudny, Moscow Region 141700, Russian Federation
b Center for Energy Systems, Skolkovo Institute of Science and Technology, Skolkovo Innovation Center, 3 Nobel Str., Moscow 143026, Russian Federation
c A. A. Dorodnicyn Computing Center, Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 40 Vavilov Str., Moscow 119333, Russian Federation

Abstract: The paper addresses the classification problem in multidimensional spaces. The authors propose a supervised modification of the t-distributed Stochastic Neighbor Embedding Algorithm. Additional features of the proposed modification are that, unlike the original algorithm, it does not require retraining if new data are added to the training set and can be easily parallelized. The novel method was applied to detect intrinsic plagiarism in a collection of documents. The authors also tested the performance of their algorithm using synthetic data and showed that the quality of classification is higher with the algorithm than without or with other algorithms for dimension reduction.

Keywords: data analysis; dimension reduction; nonlinear dimension reduction; manifold learning; intrinsic plagiarism detection.

Received: 20.02.2017

DOI: 10.14357/19922264170307



Bibliographic databases:


© Steklov Math. Inst. of RAS, 2024