Abstract:
In this article we present a new annotated Russian language corpus
named PaRuS (Parsed Russian Sentences). The corpus containing over 2.5 billion
tokens is intended for use in computer linguistics tasks involving machine learning
methods. PaRuS is a collection of annotated literary Russian sentences. Our
linguistic annotation includes morphological features in MULTEXT-East format, and
syntactic information in SynTagRus notation. We consider the methodology of
corpus creation and describe PaRuS_pipe, a hybrid linguistic pipe developed for
sentence annotation. We also discuss the quality of linguistic annotation in PaRuS
and provide an assessment of the PaRuS_pipe morphological analyzer, according
to the MorphoRuEval-2017 competition methodology.
Key words and phrases:computer linguistics, corpus linguistics, Russian, language corpus,
markup, morphology, syntax.