RUS  ENG
Full version
JOURNALS // Computer Research and Modeling // Archive

Computer Research and Modeling, 2021 Volume 13, Issue 6, Pages 1291–1315 (Mi crm949)

This article is cited in 4 papers

MODELS OF ECONOMIC AND SOCIAL SYSTEMS

Extracting knowledge from text messages: overview and state-of-the-art

A. A. Musaeva, D. A. Grigorievb

a St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), 39 Linia 14-th, VO, St. Petersburg, 199178, Russia
b Saint-Petersburg State University (SPBU), 7/9 Universitetskaya Emb., St Petersburg 199034, Russia

Abstract: In general, solving the information explosion problem can be delegated to systems for automatic processing of digital data. These systems are intended for recognizing, sorting, meaningfully processing and presenting data in formats readable and interpretable by humans. The creation of intelligent knowledge extraction systems that handle unstructured data would be a natural solution in this area. At the same time, the evident progress in these tasks for structured data contrasts with the limited success of unstructured data processing, and, in particular, document processing. Currently, this research area is undergoing active development and investigation. The present paper is a systematic survey on both Russian and international publications that are dedicated to the leading trend in automatic text data processing: Text Mining (TM). We cover the main tasks and notions of TM, as well as its place in the current AI landscape. Furthermore, we analyze the complications that arise during the processing of texts written in natural language (NLP) which are weakly structured and often provide ambiguous linguistic information. We describe the stages of text data preparation, cleaning, and selecting features which, along side the data obtained via morphological, syntactic, and semantic analysis, constitute the input for the TM process. This process can be represented as mapping a set of text documents to «knowledge». Using the case of stock trading, we demonstrate the formalization of the problem of making a trade decision based on a set of analytical recommendations. Examples of such mappings are methods of Information Retrieval (IR), text summarization, sentiment analysis, document classification and clustering, etc. The common point of all tasks and techniques of TM is the selection of word forms and their derivatives used to recognize content in NL symbol sequences. Considering IR as an example, we examine classic types of search, such as searching for word forms, phrases, patterns and concepts. Additionally, we consider the augmentation of patterns with syntactic and semantic information. Next, we provide a general description of all NLP instruments: morphological, syntactic, semantic and pragmatic analysis. Finally, we end the paper with a comparative analysis of modern TM tools which can be helpful for selecting a suitable TM platform based on the user's needs and skills.

Keywords: text mining, information extraction, natural language processing, machine learning, semantic annotations.

UDC: 519.254

Received: 20.04.2021
Revised: 24.10.2021
Accepted: 26.10.2021

DOI: 10.20537/2076-7633-2021-13-6-1291-1315



© Steklov Math. Inst. of RAS, 2024