A. L. Tkachenko, L. A. Denisova, “Automatic classification of documents in the university electronic document management system”, Informatsionnye Tekhnologii i Vychslitel'nye Sistemy, 2023, Issue 1,Pages <nobr>3

DATA PROCESSING AND ANALYSIS

Automatic classification of documents in the university electronic document management system

A. L. Tkachenko^a, L. A. Denisova^b

^a Federal State Public Educational Establishment of Higher Professional Training Moscow University of the Ministry of the Interior of the Russian Federation named after V.Y. Kikot, 12, Academician Volgin street, Moscow, 117437
^b Omsk State Technical University, 11, Mira prospekt, Omsk, 644050

Abstract: The issues of automatic text documents classification of the university in the electronic document management system are considered. A two-stage classification method based on machine learning and a numerical representation of documents is presented. It is proposed at the first stage of the method to reduce the collection size by screening out documents that do not belong to accepted classes (according to the probability of novelty of documents). At the second stage, the selection of documents with the highest occurrence frequencies of words characteristic of accepted classes documents is carried out (the formation of support vectors). The document is assigned a class to which most of the closest documents belong in accordance with the accepted distance metric. A set of programs for the text documents classification has been implemented, which is the basis for the information support of the university electronic document management system, and studies have been carried out confirming the effectiveness of the proposed method.

Keywords: document classification, the novelty of text documents, probabilistic thematic model, support vector machine, $k$-nearest neighbors.

DOI: 10.14357/20718632230101