Manning Schutze: Chapter 06: Statistical Inference
Metadata
Title: Statistical Inference: n-gram Models over Sparse Data
Number: 6
Companion website: Includes the Austen corpus used for building the n-gram model in the chapter.
Core Ideas
Statistical inference in general consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about this distribution.
MS divide the problem into 3 areas: 1) Dividing the training data into equivalence classes; 2) finding a good statistical estimator for each class; 3) combining multiple estimators.
Focus on the classic task of language modelling predicting the next word given the previous words. This is a well researched problem, the solutions to which extend to other tasks.
Forming Equivalence Classes
In order to do inference about one feature, we want to find other features of the model that predict it. We assume that past behaviour is a good guide (ie the model is stationary). So we have a classification task: predict the target feature on the basis of various classificatory features.
When doing this, we effectively dividue the data into equivalence classes that share values for certain classificatory features. We tacitly make independence assumptions. The more classificatory features we identify in separate bins, the more finely we can discriminate, ie determine the probability distribution of the target feature. However, the more bins we use the fewer training instances we have in each class, risking reliability. Our goal is to find classes that balance these competing goals of discrimination and reliability.
n-gram models
Deciding the Equivalence Classes
We can state the prediction of a next word as an attempt to estimate the probability function \(P(w_n | w_1,\dots ,w_{n-1})\). We use a classification of the history to predict the next word.
But we can’t use the whole history separately, likely we’ve never seen that sentence. So we have to group histories that are similar in some way to give reasonable predictions as to which words we can expect to come next. One way to do this is to make a Markov assumption that only the prior local context - the last few words - affects the next word.
If we construct a model where all histories that have the same last n - 1 words are placed in the same equivalence class, then we have an \((n-1)^{th}\) order Markov model or an n-gram word model (the last word of the n-gram being the word we are predicting).
We have to strike a balance in setting n. We want it to be fairly large. Consider “Sue swallowed the large green …". “swallowed” is very likely influencing what comes next. Even though “the large green tree” is a likely combo, it’s not here.
But if we have too many bins there are a lot of parameters to estimate. Here’s a sense of parameter growth as the model increases with a vocabulary of 20,000 words:
Model | Parameters |
---|---|
1st order Markov (bigram) | 400 million |
2nd order (trigram) | 8 trillion |
3rd order (4-gram) | \(1.6 \times 10^{17}\) |
So a 5-gram may not be practical at all, and we generally use bigrams and trigrams. We also might use stemming or semantic equivalence classes to reduce the size of the model without reducing n. We might also use more sophisticated models based on parsing the main predicate in a sentence, but these are complex and covered later.
This seems linguistically ridiculous - predicting a word based only on the past two words with no other understanding of syntax or semantics. In reality it works surprisingly well, and in pure prediction tasks is hard to beat.
Deciding the Estimators
Given the pieces of training data that fall into a particular bin (eg n-gram class), how do we derive a good probability estimate for the target feature?
With n-grams, this reduces to the problem of how do we get a good estimate for \(\frac{P(w_1 \dots w_n)}{P(w_1 \dots w_{n-1})}\).
They use the following notation:
Symbol | Meaning |
---|---|
N | Number of training instances |
B | Number of values in the multinomial target feature distribution |
V | Vocabulary size |
\(w_{1n}\) | An n-gram \(w_1 \dots w_n\) in the training text |
\(C(w_1 \dots w_n)\) | Frequency of n-gram \(w_{1n}\) in the training text |
r | Frequency of an n-gram |
\(f(\cdot)\) | Frequency estimate of a model |
\(N_r\) | Number of target feature values seen r times in training instances |
\(T_r\) | Total count of n-grams of frequency r in further data |
h | ‘History’ of preceding words |
Maximum Likelihood Estimation
Once we have made our equivalence classes, we shall end up with bins that contain a number of instances. For example, in a trigram model we shall have a bin where the two preceding words were “come across”. Let’s say we have 10 instances in that bin. 8 of them were followed by “as”, one by “more”, and one by “a”. What probability estimates should we use for the next word?
We could use the relative frequency as a probability estimate, so P(“as”) = 0.8, P(“more” ) = 0.1, P(“a”) = 0.1, and otherwise P(x) = 0.
This is called the maximum likelihood estimate. The name comes from the statistical notion of a likelihood function, where you fix the observed data and consider the space of all possible parameter assignments with a certain distribution (here trigram) given the data. This is maximum because it is the choice of parameter values which gives the highest probability to the training corpus. No probability mass is given to events that are not in the training corpus. It makes the probability of the observed events as high as it can, subject to stochastic constraints.
Unfortunately the MLE is unsuitable for stastical inference in NLP due to the sparseness of our data. The majority of words are uncommon, MLE assigns zero probability to unseen events. These zeroes then propagate due to the multiplication in the chain rule. Eg after training on 1.5 million word corpus, one paper found that 23% of trigrams in a test set were previously unseen. More data might help a bit, but it’s not a solution. In language:
there is a seemingly never ending tail to the probability distribution of rarer and rarer events, and we can never collect enough data to get to the end of the tail.
So we need estimators that account for the fact that we will see events that we did not see in the training data. These methods work by somewhat decreasing the probability of seen events, to create some mass left over for unseen ones. So they are called discounting methods. This is often referred to as smoothing, since probabilities without zeroes is a bit smoother than ones with.
Smoothing
Practical asides and terminology: hapax legomena refers to terms only encountered once. It is common to ignore these and map them to a single token such as ‘<UNK>’, this greatly reduces the parameter space and memory requirements, without affecting model quality. (hapax often constitute 1/2 of types, but a fraction of tokens).
Trigrams can work brilliantly if it’s seen enough data, otherwise it can fail totally: it might not have seen the preceding bigram at all, or it’s only seen a few instances and the probability is rubbish. This suggests the strategy of using the trigram if it’s seen enough data, otherwise fall back to the bigram. However we tackle this, we need some way of assigning probability estimates to new words or n-grams we haven’t seen yet.
Introduces Laplace’s law (see Smoothing (Jurafsky Martin) for details). Laplace’s law can give far too much of the probability mass to unseen events, especially with sparse data over large vocabularies.
As a result, we might use Lidstone’s law instead, where we don’t add 1, but some smaller value \(\lambda\). This is a linear interpolation of the MLE estimate and a uniform prior. The most widely used value for \(\lambda\) is 1/2. This often helps but there are two objections: 1) We need a good way to guess \(\lambda\), and 2) this always produces estimates linear in the MLE frequency, which isn’t a good match to the empirical distribution at low frequencies.
Testing and Cross-Validation
In statistical NLP we often divide our data into three, because we shall use some portion of reserved held out data to smooth the probabilites from the training data, and then further test on a final set of test data which can be used for published evaluation. It’s common to iterate several times in model building with a training and development set. It’s also good to report on the variance of performance across groups in the final test set, rather than just a single metric.
Rather than using some of the training data for frequency counts and some for smoothing estimates, we can apply methods that use each part of the training data both as initial training and as held out data. These methods are known as cross-validation.
One such method is deleted estimation which interpolates on the training and held out data, calculating a weighted average having trained and smoothed on both sets. On large corpora this seems to produce results close to the gold standard, but it’s still a way off for smaller training sets, we still have the issue of oversestimating the probability of unseen n-grams.
Another approach is Leaving-One-Out where the primary training corpus is N - 1 tokens, with one token being left out for testing. Then the process is repeated N times so every piece of data is left out in turn. This can show how the model changes if any piece of data has not been observed.
Good-Turing Estimation
One method for probability estimation comes from Turing, via Good, so is known as the Good-Turing estimator. In Turing’s version it assumes binomial distribution of the items, but it works also where we have a large number of observations over a large vocabulary, so applies well to n-grams even absent binomial distribution. The approach hinges on finding an \(r^*\), an adjusted frequency, and the Good-Turing probability is then \(P_{GT} = r^*/N\). How do we find \(r^*\)? For previously observed items, the theorem holds that:
\[r^* = (r+1) \frac{E(N_{r+1})}{E(N_r)}\]
Where E is the expectation of a random variable. The total probability mass reserved for unseen events is \(E(N_1)/N\). We can hope to use our observed frequencies \(N_r\) for the expectation \(E(N_r)\). But there’s an issue. For our most frequently observed n-grams \(E(N_{r+1})\) will be zero, and so the probability of our most frequent observations will also be 0.
A way out of that is either to use G-T as a re-estimation method for low frequencies where \(r < k\) for some constant k (eg 10). For high frequency words we just use MLE, where the high counts make that more accurate than it is for low frequencies.
Another way out is to fit a function S through the observed values \((r, N_r)\) and use the smoothed values \(S( r)\) for the expectation.
p.213 of the book presents the full formulae, and digs into the Simple Good-Turing method, which combines the two approaches.
Under any approach the estimates will need to be renormalized to produce a proper probability distribution. This can be done either by adjusting the probabilities assigned to unseen events, or seen events.
Combining Estimators
So far the methods have all made use of nothing but the raw frequency r of an n-gram and tried to produce the best guess from that.
But instead of just giving the same probability to all n-grams that never appeared, we might do better by looking at (n-1)-grams, if those are rare we could give a low estimate, but if they are common, we could give a higher estimate.
This raises the more general question of how we might combine the probability estimates of different models to produce better results. For n-gram models, the secret to success is combining various models of different orders.
Linear Interpolation
Trigram models suffer a sparseness problem, one solution is to mix the model with bigram and unigram models. We can make a linear combination of them, providing that we weight the contribution of each so that the result is another probability function. In NLP this is called linear interpolation, but it also has the name (finite) mixture models. The basic frame of linear interpolation is:
\[P_{li}(w_n | w_{n-2}, w_{n-1}) = \lambda_1 P_1(w_n)\]
\[+ \lambda_2 P_2(w_n | w_{n-1})\]
\[+ \lambda_3 P_3(w_n | w_{n-1}, w_{n-2})\]
Where \(0 \leq \lambda_i \leq 1\) and \(\sum_i \lambda_i = 1\). The weights may be set by hand or learned.
Back-off
In back-off models different models are consulted in order depending on their specificity. The estimate for an n-gram is allowed to back off through progressively shorter histories. If the n-gram has occured more than k times (k usually = 0 or 1) then an n-gram estimate is used, but the MLE estimate is discounted by a function d so probability mass is reserved. G-T estimates can be used to calculate the discount. If the n-gram occured k times or less, we back-off to the shorter n-gram, but that probability is multiplied by an \(\alpha\) factor to normalize the estimates.
Back-off can work badly in some circumstances, eg where we see \(w_iw_j\) many times, but never see \(w_iw_jw_k\) that might be grammatically significant or to do with named entities. There are complications of the model to try to accommodate for this. The models can work well in practice, despite these issues, and are very simple.
General Linear Interpolation
Back-off models and linear interpolations are in fact a special case of a more general linear interpolation model for combining k probability functions:
\[P_{li}(w|h) = \sum^k_{i=1} \lambda_i(h)P_i(w | h)\]
Where for any h \(0 \leq \lambda_i(h) \leq 1\) and \(\sum_i \lambda_i(h) = 1\).