RUS  ENG
Full version
JOURNALS // Problemy Upravleniya // Archive

Probl. Upr., 2019 Issue 2, Pages 41–46 (Mi pu1129)

Information technologies controls

System of thematically-oriented texts automatic processing with dictionary of terms in the form of regular expressions

V. S. Sukhoverov

V.A. Trapeznikov Institute of Control Sciences of RAS, Moscow

Abstract: The system of automatic text processing is developed that determines the text subject based on the terminology used, according to the dictionary of terms. The application of regular expressions is proposed and justified in domain-specific dictionaries used in the programs of text analysis in natural languages. The interrelation of regular expressions and finite automata through regular sets is noted and described. A quantitative assessment is suggested of the thematic focus of the text investigated - the document profile, calculated by the terms search results. The system is implemented in practice in the form of a software package with a dictionary version for the selected subject area - control theory and its applications. The system was tested on the archive of the journal «Automation and Remote Control». The profiles of the thematic focus of the articles taken from various sections of the journal were obtained. The opportunities of the system development are indicated.

Keywords: term, domain dictionary, regular expression, finite state machine, document profile, software package.

UDC: 004.912

Received: 27.09.2018
Revised: 22.10.2018
Accepted: 12.12.2018

DOI: 10.25728/pu.2019.2.5



© Steklov Math. Inst. of RAS, 2025