V. Y. Korolev, A. Y. Korchagin, I. V. Mashechkin, M. I. Petrovskiy, D. V. Tsarev, “Applying time series to the task of background user identification based on their text data analysis”, Proceedings of ISP RAS, 2015, Volume 27, Issue 1,Pages <nobr>151

This article is cited in 3 papers

Applying time series to the task of background user identification based on their text data analysis

V. Y. Korolev, A. Y. Korchagin, I. V. Mashechkin, M. I. Petrovskiy, D. V. Tsarev

Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University

Abstract: The paper presents the novel approach of user identification based on behavior analytics of user operations with a text information. It is offered to describe user behavior by content of his text documents. The structured representation of the considered behavioral information is carried out based on representation of documents text content in the user topic space, which is created by non-negative matrix factorization. The topic weights in the document characterize the user’s topic trend during an operating time with this document. The time variation of the topic weight values creates multidimensional time series that describe the history of user behavior when working with text data. Forecasting of such time series will allow for user identification based on estimated deviation of observed topic trend from the predicted topic weight values. This paper also presents the new time series forecasting method based on orthogonal nonnegative matrix factorization (ONMF) which is used within proposed user identification approach. It is worth noting that nonnegative matrix factorization methods were not used before for the time series forecasting task. The proposed user identification approach has been experimentally verified on the example of real corporate email correspondence created from the Enron dataset. In addition, experiments with other today popular forecasting methods have shown the superiority of proposed forecasting method in quality of user’s topic weights classification. Also we investigated two different approaches to estimates of the deviation of a time series point from the predicted value: absolute deviation and p-value estimation. Experiments have shown that both discussed approaches of deviation estimates are applicable in the proposed user identification approach.

Keywords: computer security, user identification, topic modeling, orthogonal nonnegative matrix factorization, time series forecasting.

DOI: 10.15514/ISPRAS-2015-27(1)-8