A. P. Zykov, “N gram smoothing based on modeling of expectation of n gram occurrence”, Tr. SPIIRAN, 2011, Issue 19,Pages <nobr>146

N gram smoothing based on modeling of expectation of n gram occurrence

A. P. Zykov

Abstract: It is shown that expectation of n gram frequency of occurrence depends on the size of the training set and the size of the dictionary, which has been formed on the basis of this set. A method for smoothing of n gram language model regarding probabilities of n grams of lower order is proposed. This approach is based on the modeling of expectation function of n gram occurrence probability. We suggest enlarging the size of the training set on the expected number of unseen n grams instead of discounting maximum n gram probability. To model the number of unseen n grams expectation function of n gram frequency of occurrence is extrapolated to zero frequency. Expectation function is modeled by the statistical analysis of occurrences of words in texts.

Keywords: language model, smoothing techniques.

UDC: 519.766.4

Received: 05.07.2011
Accepted: 29.11.2011