N-Grams (Jurafsky Martin)

As presented in Jurafsky Martin Chapter 03: N-gram Language Models

Definitions and Key Quotes

Models that assign probabilities to sequences of words are called language models or LMs… [T]he simplest model… assigns probabilities to sentences and sequences of words, the n-gram. An n-gram is a sequence of n words.

So a 2-gram (or bigram) is a sequence of two words, a 3-gram (or trigram) is a sequence of three words. N-gram models estimate the probability of the last word of an n-gram given the previous words.

N-Grams

A False Start

We start with the task of computing \(P(w|h)\), the probability of a word \(w\) given some history \(h\). For example if w is “the”, and h is “its water is so transparent that”, then the probability is \(P(the|its\;water\;is\;so\;transparent\;that)\). Or: \[\frac{C(its\;water\;is\;so\;transparent\;that\;the)}{C(its\;water\;is\;so\;transparent\;that)}\]

If we have a large corpus, we could just compute the counts. But in practice even the whole web is not big enough to give us good estimates in most cases. This is because language is generative and so new sentences are created all the time, which will have a probability count of 0 based on counting alone.

Likewise the probability of any individual 5-gram would be the count of that 5-gram divided by the count of all possible 5-grams, which is huge…

The Chain Rule

Notation:

\(P(the)\) is a simplification of \(P(X_i='the’)\), where \(X_i\) is some random variable.
A sequence of N words is either \(w_1\dots w_n\) or \(w_{1:n}\) so \(w_{1:n-1}\) is all but the last word.
For the joint probability of each word in a sequence having a particular value we use \(P(w_1, w_2,\dots , w_n)\) as a shorthand for \(P(X=w_1, Y=w_2,\dots , W = w_n)\).

The chain rule of probability states that:

\[P(X_1 \dots X_n) = P(X_1)P(X_2|X_1)P(X_3|X_{1:2})\dots P(X_n|X_{n-1})\]

\[= \prod^n_{k=1} P(X_k|X_{1:k-1})\]

Applying this to words we can see the link between the joing probability of a sequence and the conditional probability of a word given the previous words:

\[P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})\dots P(w_n|w_{n-1})\]

\[= \prod^n_{k=1} P(w_k|w_{1:k-1})\]

On its own this doesn’t seem to help much, as we still don’t know how to compute the exact probability of a word given a long sequence of preceding words.

Approximation and the Markov Assumption

Instead of computing the probability of a word given its entire history, in n-gram models we approximate the history by using just the last few words to stand for the entire history. Eg in a bigram model, we just use the preceding word. So we make the following approximation:

\[P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-1}\]

The assumption that the probability of a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.

We can generalize the bigram model, which looks one word back, to the n-gram model, which looks \(n-1\) words back:

\[P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-N+1:n-1})\]

In the bigram model the probability of a whole word sequence is:

\[P(w_{1:n}) \approx \prod^n_{k=1} P(w_k|w_{k-1})\]

Maximum Likelihood Estimation

How do we estimate these n-gram probabilities? An intuitive way is called maximum likelihood estimation or MLE, where we get counts from a corpus and then normalize to a real value in the range [0,1].

For example in a bigram model we can compute the bigram probability:

\[P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}\]

Generalizing to the n-gram model:

\[P(w_n|w_{n-N+1:n-1}) = \frac{C(w_{n-N+1:n-1}w_n)}{C(w_{n-N+1:n-1})}\]

This ratio, of the observed frequency of a particular sequence to the observed frequency of a prefix, is called a relative frequency.

What kinds of linguistic phenomena are captured in these bigram statistics? Some probabilities encode syntactic aspects, like the fact that what comes after “to” is often a verb. Others might be a fact related to the corpous and task involved, like personal assistant queries often starting with “I”. Some might be cultural, like users of a search corpus often asking about “Chinese food”.

Practically, it is very common to use trigram, 4-gram, or 5-gram models when sufficient data is available.

We use special symbols, like “<s>” to denote the context of the beginning of a sentence (and “</s>” for the end). Eg for a trigram probability of “I” at the beginning of a sentence we look at \(P(I|<s><s>)\).

We always represent and compute language model probabilities in log format as log probabilities. This avoids the issue of number underflow, caused by working with multiplication chains of numbers < 1, producing very small numbers.

Alex's Notes