Abstract:
The article provides a statistical analysis of the properties of lexical and $n$-gram models of the Russian language based on the news text corpus. A specialized corpus of political news articles of recent years has been created, reflecting a narrow area of language use. The token and $n$-gram dictionaries are compiled, the coverage values are found, as well as the values of entropy. Lemmatization of the original text corpus and extrapolation of the dictionary volumes are performed.