RUS  ENG
Full version
JOURNALS // Uchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki // Archive

Uchenye Zapiski Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki, 2013 Volume 155, Book 4, Pages 16–23 (Mi uzku1237)

Verification of the Heaps law using the Google Books Ngram database

V. V. Bochkareva, E. Yu. Lernerb, A. V. Shevlyakovac

a Institute of Physics, Kazan (Volga region) Federal University, Kazan, Russia
b Institute of Computer Mathematics and Information Technologies, Kazan (Volga Region) Federal University, Kazan, Russia
c Kazan (Volga Region) Federal University, Kazan, Russia

Abstract: This article is devoted to the verification of the Heaps empirical law for European languages using the Google Books Ngram corpus data. It is shown that the Heaps law holds only for short texts and texts related to short historical periods. The Heaps exponent decreases in time and varies significantly within characteristic intervals of 60–100 years. The relationship between the word frequency distribution and the expected dependence of the number of individual words on the text size is analyzed in terms of a simple probability model of text generation. This model serves as an explanation for the observed decreasing trend of the Heaps exponent.

Keywords: Heaps law, Zipf law, text probability models, Google Books Ngram corpus.

UDC: 81.32+519.257+519.246.2

Received: 17.10.2013



© Steklov Math. Inst. of RAS, 2024