Abstract:
The issues of automatic text documents classification of the university in the electronic document management system are considered. A two-stage classification method based on machine learning and a numerical representation of documents is presented. It is proposed at the first stage of the method to reduce the collection size by screening out documents that do not belong to accepted classes (according to the probability of novelty of documents). At the second stage, the selection of documents with the highest occurrence frequencies of words characteristic of accepted classes documents is carried out (the formation of support vectors). The document is assigned a class to which most of the closest documents belong in accordance with the accepted distance metric. A set of programs for the text documents classification has been implemented, which is the basis for the information support of the university electronic document management system, and studies have been carried out confirming the effectiveness of the proposed method.
Keywords:document classification, the novelty of text documents, probabilistic thematic model, support vector machine, $k$-nearest neighbors.