Abstract:
It is shown that expectation of n gram frequency of occurrence depends on the size of the training set and the size of the dictionary, which has been formed on the basis of this set. A method for smoothing of n gram language model regarding probabilities of n grams of lower order is proposed. This approach is based on the modeling of expectation function of n gram occurrence probability. We suggest enlarging the size of the training set on the expected number of unseen n grams instead of discounting maximum n gram probability. To model the number of unseen n grams expectation function of n gram frequency of occurrence is extrapolated to zero frequency. Expectation function is modeled by the statistical analysis of occurrences of words in texts.