RUS  ENG
Full version
JOURNALS // Sistemy i Sredstva Informatiki [Systems and Means of Informatics] // Archive

Sistemy i Sredstva Inform., 2022 Volume 32, Issue 4, Pages 59–68 (Mi ssi856)

Tokenization based on the method of functional patterns

Yu. V. Nikitina, A. A. Khoroshilovbac, A. E. Makarovad

a Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences, 44-2 Vavilov Str., Moscow 119333, Russian Federation
b Moscow State Aviation Institute (National Research University), 4 Volokolamskoe Shosse, Moscow 125933, Russian Federation
c 27th Central Research Institute of the Ministry of Defence of the Russian Federation, 5, 1st Khoroshevsky Passage, Moscow 123007, Russian Federation
d Scientific Industrial Joint Stock Company "High Technology and Strategic Systems," 27-9 Elektrozavodskaya Str., Moscow 107023, Russian Federation

Abstract: The article proposes a new method of text tokenization based on the use of generalized functional templates. The method is based on the classification of Unicode characters in terms of their role in the formation of text elements and on the use of compound patterns from the generalized character classes. Widespread regular expressions are not used here. A specific feature of the method is the use of a sequence of characters as a part of the interval template. The strengths of the method include successful tokenization of complex information objects (numbers, geographic coordinates, names of articles of engineering products, etc.), obtaining the detailed classification of tokens at the stage of their formation, the ability to turn on and off tokenization of a certain type of tokens, as well as adding new templates according to the sample text for additional training of the system.

Keywords: tokenization, segmentation, graphematic analysis, computational linguistics, patterns, substitution, token.

Received: 15.09.2022

DOI: 10.14357/08696527220406



© Steklov Math. Inst. of RAS, 2024