RUS  ENG
Full version
JOURNALS // Informatics and Automation // Archive

Tr. SPIIRAN, 2014 Issue 33, Pages 164–185 (Mi trspy728)

Automatic Extraction of Context Labels from the Russian Wiktionary

A. A. Krizhanovskya, A. V. Smirnova, V. M. Kruglovb, N. B. Krizhanovskayac, I. S. Kipyatkovaa

a Laboratory of Speech and Multimodal Interfaces St. Petersburg Insti-tute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS)
b Institute for Linguistic Studies of the Russian Academy of Sciences
c Institute of Applied Mathematical Research of the Karelian Research Centre of the Russian Academy of Sciences

Abstract: The methodology of extracting context labels from internet dictionaries was developed. In accordance with this methodology experts constructed a mapping table that establishes a correspondence between Russian Wiktionary context labels (385 labels) and English Wiktionary context labels (1001 labels). As a result the composite system of context labels (1096 labels), which includes both dictionary labels, was constructed. The parser extracting context labels from the Russian Wiktionary was developed. The parser can recognize and extract new context labels, abbreviations and comments placed before the definition in Wiktionary articles. One outstanding feature of this parser is a large number of context labels which are known in advance (385 context labels for Russian Wiktionary). The parser can recognize and extract new context labels, abbreviations and comments placed before the definition in Wiktionary articles. The database of machine-readable Russian Wiktionary including context labels was generated by the parser. An evaluation of numerical parameters of context labels in the Russian Wiktionary was performed. With the help of the developed computer program it was found in the Russian Wiktionary that (1) there are 133 000 definitions with context labels and comments, (2) one and a half thousand definitions were supplied with regional labels, (3) it was calculated a number of definitions with labels for each domain knowledge. This paper is an original contribution to computational lexicography, setting out for the first time an analysis of numerical parameters of context labels in the large dictionary (500 000 entries).

Keywords: Computational Linguistics, Computational Lexicology, Russian Language.

UDC: 004.912



© Steklov Math. Inst. of RAS, 2024