I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov, “Additive regularizarion of topic models with fast text vectorizartion”, Computer Research and Modeling, 2020, Volume 12, Issue 6,Pages <nobr>1515

This article is cited in 4 papers

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Additive regularizarion of topic models with fast text vectorizartion

I. A. Irkhin, V. G. Bulatov, K. V. Vorontsov

Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow oblast, Dolgoprudny, 141701, Russia

Abstract: The probabilistic topic model of a text document collection finds two matrices: a matrix of conditional probabilities of topics in documents and a matrix of conditional probabilities of words in topics. Each document is represented by a multiset of words also called the “bag of words”, thus assuming that the order of words is not important for revealing the latent topics of the document. Under this assumption, the problem is reduced to a low-rank non-negative matrix factorization governed by likelihood maximization. In general, this problem is illposed having an infinite set of solutions. In order to regularize the solution, a weighted sum of optimization criteria is added to the log-likelihood. When modeling large text collections, storing the first matrix seems to be impractical, since its size is proportional to the number of documents in the collection. At the same time, the topical vector representation (embedding) of documents is necessary for solving many text analysis tasks, such as information retrieval, clustering, classification, and summarization of texts. In practice, the topical embeddingis calculated for a document “on-the-fly”, which may require dozens of iterations over all the words of the document. In this paper, we propose a way to calculate a topical embedding quickly, by one pass over document words. For this, an additional constraint is introduced into the model in the form of an equation, which calculates the first matrix from the second one in linear time. Although formally this constraint is not an optimization criterion, in fact it plays the role of a regularizer and can be used in combination with other regularizers within the additive regularization framework ARTM. Experiments on three text collections have shown that the proposed method improves the model in terms of sparseness, difference, logLift and coherence measures of topic quality. The open source libraries BigARTM and TopicNet were used for the experiments.

Keywords: natural language processing, unsupervised learning, topic modeling, additive regularization of topic model, EM-algorithm, PLSA, LDA, ARTM, BigARTM, TopicNet.

UDC: 004.852, 519.853

Received: 21.09.2020
Revised: 01.10.2020
Accepted: 05.10.2020

DOI: 10.20537/2076-7633-2020-12-6-1515-1528