Smoothing (Jurafsky Martin)

As discussed in Jurafsky Martin Chapter 03: N-gram Language Models

Introduction

Sometimes we will see words that are in the vocabulary of our language model but appear in a test set in an unseen context (for example after a word that never appeared in the training set).

To keep a language model from assigning zero probability to these events, we shave off a bit of probability mass from some more frequent events, and give it to the events that we’ve never seen. This is called smoothing or discounting.

JM present several ways to do smoothing:

Laplace Smoothing

The simplest way to smooth is to add one to all the n-gram counts, before normalization to probabilities. This is called Laplace smoothing, or add-one smoothing.

It does not perform well enough to be used in modern n-gram models, but provides a helpful baseline, and is practical for other tasks, like text classification.

The issue with Laplace smoothing is that it affects the probabilities too much, by shifting too much of the probability mass to the zero counts. Other methods try to fix this:

Add-k Smoothing

An alternative to add-one smoothing, is to add a bit less, we add a fractional count k, like 0.5, 0.1. We need to choose k, which we tend to do by optimizing for a devset. Again this is useful for some tasks like text classificaiton, but doesn’t perform well in language models, generating counts with poor variances and inappropriate discounts.

Backoff and Interpolation

There is an alternative to predicting the probability of \(P(w_n|w_{n-2}w_{n-1})\) when we haven’t seen any instances of \(w_{n-2}w_{n-1}w_n\). We can decide to use less context, estimating the probability based on a shorter history. There are two approaches:

Backoff

In backoff, we use eg the trigram if the evidence is sufficient, otherwise the bigram, otherwise the unigram. In other words only “back off” to a lower-order n-gram if we have zero evidence for a higher-order one.

We need to discount the higher-order n-grams to save some probability for the lower-order ones. This kind of backoff with discounting is called Katz backoff.

Interpolation

In interpolation we always mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram, and unigram counts.

This might be a simple linear interpolation, where the unigram, bigram, and trigram probabilities are weighted by a \(\lambda\) that sum to 1:

\[\hat{P}(w_n|w_{n-2}w_{n-1}) = \lambda_1 P(w_n)\]

\[+ \lambda_2 P(w_n | w_{n-1})\]

\[+ \lambda_3 P(w_n | w_{n-2}w_{n-1})\]

This can be made more complex if we compute the \(\lambda\) weight based on context. For example, if we have very accurate counts for a particular bigram, we can weight the accompanying \(\lambda\) values more highly.

\[\hat{P}(w_n|w_{n-2}w_{n-1}) = \lambda_1 (w_{n-2:w_n-1}) P(w_n)\]

\[+ \lambda_2 (w_{n-2:w_n-1})P(w_n | w_{n-1})\]

\[+ \lambda_3 (w_{n-2:w_n-1})P(w_n | w_{n-2}w_{n-1})\]

These lambda values are set by training on a held-out corpus where we fix the n-gram probabilities and then search for the optimum lambda values.

Kneser-Ney Smoothing

The interpolated Kneser-Ney algorithm is one of the most commonly used and best performing n-gram smoothing methods.

It builds on the idea of absolute discounting where we substract a fixed absolute discount d from each count we see in the training set. This will not affect those n-grams we’ve seen a lot very much, but it will affect those we haven’t seen very much, which is Ok as we don’t trust those counts as much.

Kneser-Ney augments this idea by looking at the idea of how likely it is that we shall see words in a new context. For example the word glasses occurs in a wider range of contexts than Kong, even though Kong is very common in the phrase Hong Kong.

We can count the number of contexts of a word, turn that into a probability, and then add that to the absolute discount to weight our models more accurately to words that appear widely over words that appear only in a narrow range of contexts.

The chapter contains the mathematical formalizations if needed.

Large Language Models and Stupid Backoff

We can build very large language models by using the web or other big corpora. See eg Google’s Ngrams corpora from its books collection, or its web corpus.

When representing such large corpora we tend to store each word as a hash value, n-grams are stored in reverse tries, and probabilities are stored using only 4-8 bits.

We can shrink the models by pruning, only storing n-grams with counts greater than a threshold. With such tools it’s possible to use full Kneser-Ney smoothing, but with large corpora simpler algorithms may be sufficient. For example stupid backoff does not produce an actual probability discount, as it does not discount higher-order n-grams to reduce the probability mass. It simply backs off to a lower order n-gram, weighed by a fixed (context-independent) weight.

Alex's Notes