Alex's Notes

Jurafsky Martin Chapter 03: N-gram Language Models

Metadata

Definitions and Key Quotes

Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. An n-gram is a sequence of n words.

So a 2-gram (or bigram) is a sequence of two words, a 3-gram (or trigram) is a sequence of three words. N-gram models estimate the probability of the last word of an n-gram given the previous words.

Core Ideas

Starts with an intuitive example. How might we predict the word that follows: “Please turn your homework…"? it seems more likely to be “in” than “refrigerator”. We’ll formalize this intuition by looking at models that assign a probability to each possible next word. The same models can assign a probability to a whole sentence, so predicting that the sequence “all of a sudden I notice three guys standing on a sidwalk” is a lot more likely than “on guys all I of notice sidewalk three a sudden standing the”.

Why do this? Probabilities are essential for several tasks in NLP where we wan to identify words in noisy, ambiguous input. Like speech recognition, spelling correction, machine translation, and grammatical error correction.

We want our speech recognizer to predict “I will be back soonish” not “I will be bassoon dish”, and our grammar check to correct “Their are two midterms”. It can do that with probability models.

N-gram models can assign such probabilities. They are much simpler than neural language models, but remain an important foundational tool for understanding the fundamental concepts of language modelling.

N-Grams

The chapter then presents N-Grams

The chapter includes a simple corpus example of a few sentences, and dips into the Brown corpus, not included in the notes.

Evaluating Language Models

The chapter then moves on to Language Model Evaluation

It looks at the idea of perplexity, then discusses issues around generalization, introducing the idea of unknown words, missing from the training corpus. Later it introduces the concept of entropy and relation to perplexity.

Smoothing

The chapter then presents several methods for Smoothing, ie methods for words in our vocabulary that are seen in an unseen context.