RUS  ENG
Full version
JOURNALS // Program Systems: Theory and Applications // Archive

Program Systems: Theory and Applications, 2019 Volume 10, Issue 4, Pages 181–199 (Mi ps358)

This article is cited in 3 papers

Artificial Intelligence, Intelligent Systems, Neural Networks

PaRuS — syntax annotated Russian corpus

N. A. Vlasova, I. V. Trofimov, Yu. P. Serdyuk, E. A. Suleymanova, I. N. Vozdvizhenskiy

Ailamazyan Program Systems Institute of Russian Academy of Sciences

Abstract: In this article we present a new annotated Russian language corpus named PaRuS (Parsed Russian Sentences). The corpus containing over 2.5 billion tokens is intended for use in computer linguistics tasks involving machine learning methods. PaRuS is a collection of annotated literary Russian sentences. Our linguistic annotation includes morphological features in MULTEXT-East format, and syntactic information in SynTagRus notation. We consider the methodology of corpus creation and describe PaRuS_pipe, a hybrid linguistic pipe developed for sentence annotation. We also discuss the quality of linguistic annotation in PaRuS and provide an assessment of the PaRuS_pipe morphological analyzer, according to the MorphoRuEval-2017 competition methodology.

Key words and phrases: computer linguistics, corpus linguistics, Russian, language corpus, markup, morphology, syntax.

UDC: 004.89:81'322.2
BBK: Ø111:Ç813

MSC: Primary 68T50; Secondary 91F20

Received: 19.11.2019
Accepted: 26.12.2019

DOI: 10.25209/2079-3316-2019-10-4-181-199



© Steklov Math. Inst. of RAS, 2024