RUS  ENG
Full version
JOURNALS // Informatics and Automation // Archive

Tr. SPIIRAN, 2011 Issue 19, Pages 146–158 (Mi trspy440)

N gram smoothing based on modeling of expectation of n gram occurrence

A. P. Zykov


Abstract: It is shown that expectation of n gram frequency of occurrence depends on the size of the training set and the size of the dictionary, which has been formed on the basis of this set. A method for smoothing of n gram language model regarding probabilities of n grams of lower order is proposed. This approach is based on the modeling of expectation function of n gram occurrence probability. We suggest enlarging the size of the training set on the expected number of unseen n grams instead of discounting maximum n gram probability. To model the number of unseen n grams expectation function of n gram frequency of occurrence is extrapolated to zero frequency. Expectation function is modeled by the statistical analysis of occurrences of words in texts.

Keywords: language model, smoothing techniques.

UDC: 519.766.4

Received: 05.07.2011
Accepted: 29.11.2011



© Steklov Math. Inst. of RAS, 2024