Language Model Evaluation (Jurafsky Martin)

As presented in Jurafsky Martin Chapter 03: N-gram Language Models

Evaluating Language Models

The best way to evaluate the performance of a language model is to embed it in an application and see if it helps in a particular task. This is called extrinsic evaluation. Eg for speech recognition, we would want to run the speech recognizer with two alternative models and assess them.

This is often expensive, and we want a way to quickly inspect the potential improvements of a new model. An intrinsic evaluation metric is one that measure the quality of a model independent of an application.

As elsewhere in ML, we can divide our corpus into a training set and test set, we could then train the model on the test set, and see how well it fits the test set. How does it fit the test set? Of two models, the one that assigns the highest probability to the test set is the one that fits it better.

If we use a particular test set too much, we can implicitly end up training on it, then we need a further set of unseen data to test on. Then we call the initial test set the devset or development set. We often divide into 80% training, 10% devevlopment, and 10% test.

Perplexity

In practice we don’t use raw probability to evaulate language models, but a variant called perplexity.

The perplexity… of a language mdoel on a test set is the inverse probability of the test set, normalized by the number of words.

Perplexity is often called PP for short. Formally the perplexity of a test set W is:

\[PP(W) = \sqrt[N]{\prod^N_{i=1} \frac{1}{P(w_i|w_1 \dots w_i-1)}}\]

Because of the inverse, the higher the conditional probability, the lower the perplexity. So minimizing perplexity is equivalent to maximizing the test set probability according to the model.

Another way of thinking about perplexity is as a weighted avergage branching factor. Here the branching factor of a language is the number of possible next words that can follow any word. The weighting comes from how the theoretical branching factor from the alphabet is constrained by the probabilities seen in the training data.

Perplexity is only comparable across two language models if they use identical vocabularies. JM show an example where trigram model has much lower perplexity than a bigram model when trained on the WSJ corpus. It correlates with improvement in real world tasks, but does not guarantee it, and so has to be checked.

Sampling from a Model and Generalization

One important way to visualize what kind of knowledge a language model embodies is to smaple from it. Sampling from a distribution means to choose random points according to their likelihood. Thus sampling from a language model - which represents a distribution over sentences - means to generate some sentences, choosing each sentence according to its likelihood as defined by the model.

Eg to generate a bigram, we first generate a bigram that starts with ‘<S>’, then choose a random bigram starting with the second word in the first, and so on, each according to their probability.

The n-gram model is dependent on the the training data. It will encode facts about that data, and model the training corpus better with a higher N. JM show the latter point with some examples from modelling Shakespeare with n-grams from 1 to 4. By trigrams the model samples are generating coherent sentences. By 4 grams, the model is taking sentences directly from the training set.

This is because for many 4-grams there is only one continuation from a corpus that size. To see the issue of generalization, they compare the generalized sentences with a model trained on the WSJ.

To overcome these issues, we need to use a training corpus that has a similar genre to the task we’re trying to accomplish. To translate legal docs, we need a training corpus of legal documents. We also need the appropriate dialect.

Our models might still suffer from sparsity in which we’ll have many cases of zero-probability n-grams. For example, if our training set has “denied the allegations”, “denied the report”, but not “denied the offer”.

We might also have the issue of unknown or out of vocabulary words, ie words we haven’t seen in the training corpus. An open vocabulary system is one where we model such unseen words with a psuedo-word ‘<UNK>’.

There are two common ways to train the probabilities of the unknown word model. One is to turn the issue back into a close vocabulary system by pre-deciding your word list, and then in the training set converting anything not in it to the ‘UNK’ symbol. Alternatively, we might do something similar based on frequency, rather than a pre-defined vocabulary, converting all words with a low frequency < n into the ‘UNK’ symbol. Or else deciding we’ll have a vocabulary of size N and picking the N most frequent words in the training corpus as that vocab.

Alex's Notes

Language Model Evaluation (Jurafsky Martin)

Evaluating Language Models

Perplexity

Sampling from a Model and Generalization

Links to this note