Abstract:
The paper proposes an approach to the classification of text documents using a probabilistic topic model, with a training set of documents represented by instances of one class. The proposed approach allows selecting positive instances similar to a given class from collections and text document flows. The models learned on instances of one class, solving problems of classification in application to text documents are considered, the key features of such models are indicated. The classification model Positive Example Based Learning-TM is presented and a software prototype is developed, which realizes the classification of text documents based on it. The developed model demonstrates high classification accuracy, which exceeds the alternative approaches. The proposed model as well as existing models was evaluated based on the SCTM-ru text corpora. Experimentally proved the superiority of Positive Example Based Learning-TM by the criterion of classification accuracy with a small size of training set.
Keywords:classification, binary classification, topic model, natural language processing.