Alex's Notes

cm3060 Topic 03: Language Modelling

Main Info

Description

In this topic we start to get into mathematical models of natural language:

Week Five: Introduction to Language Modelling; Conditional Frequency Distributions; Text generation

Week Six: Probability Estimation; Evaluating language models; Topic modelling (very briefly touches on collocations)

Key Reading

Additional Assigned Reading

Supplementary Reading (Not Assigned)

Lecture Summaries

Topic Introduction is missing currently

Computing Statistics

Shows importing the book samples from nltk.book. Distinguishes between tokens and types (word instances, vs unique word instances, or vocabulary).

Shows a lexical diversity metric:

def lexical_diversity(text):
    return len(set(text)) / len(text)

Shows the frequency distribution of a text, using the nltk.FreqDist(text) function:

f = FreqDist(text)
f.most_common(10) # 10 most common
f['love'] # frequency of love
f.plot(50, cumulative=False) # plot the frequencies

Shows collocations, words that stick together more than by chance (eg ‘red wine’):

text1.collocations()

Conditional Frequency Distributions

Introduces the Brown corpus (1961).

Introduces the idea of conditional frequency distributions by showing the conditional frequencies of the Brown corpus by category.

Shows using CFDs to look at differences in word lengths across languages using the UDHR corpus.

Generating Text with Conditional Frequencies

Generates words using a bigram CFD:

text = brown.words(categories='news')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_text(cfdist, word, num=50):
    for i in range(num):
	print(word, end=' ')
	word = cfdist[word].max()

# here's how to do it with trigrams:

trigrams = nltk.trigrams(text)
condition_pairs = (((w0,w1), w2) for w0, w1, w2 in trigrams)
cfd = nltk.ConditionalFreqDist(condition_pairs)

def generate_text(cfdist, w1, w2, num=50):
    if num < 1:
	return ''
    return w1 + ' ' + generate_text(cfdist, w2, cfdist[(w1, w2)].max(), num - 1)

Rather than just using max, we could use a ‘roulette wheel’ selection based on the probability of each match…

import nltk
import random
from nltk.corpus import brown

text = brown.words(categories='news')
trigrams = nltk.trigrams(text)
condition_pairs = (((w0, w1), w2) for w0, w1, w2 in trigrams)

cfd = nltk.ConditionalFreqDist(condition_pairs)

def roulette_select(dist):
    choice = random.choices(list(dist.keys()), list(dist.values()))[0]
    return choice

def generate_text(cfdist, w1, w2, num=50):
    if num < 1:
	return ''
    w3 = roulette_select(cfdist[(w1,w2)])
    return w1 + ' ' + generate_text(cfdist, w2, w3, num - 1)

Language Modelling

Builds on the word generation to start to look at probabilistic language models. Uses in machine translation, speech recognition, summarization, spelling correction etc.

Introduces the idea of the probability of a sequence:

\[P(W) = P(w_1, w_2, w_3, \dots, w_n)\]

And the probability of the next term:

\[P(w_5 | w_1, w_2, w_3, w_4, w_5)\]

Introduces the chain rule (see note on N-Grams (Jurafsky Martin)), and the issue that we can’t estimate all sequences from finite training data, we will encounter sequences we didn’t see so it would have zero probability.

Introduces the Markov assumption as a solution, fixing the length of the history you consider in conditional probabilities. Simplest case, use unigram frequencies. Or bigrams. But these will, eg, under-estimate the probability of ‘mat’ in ‘the cat sat on the’, since it just looks at the probability of ‘mat’ following ‘the’.

Next video digs into the maximum likelihood estimator:

\[P(w_i|w_{i-1}) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}\]

Introduces an example, including the starter and end sentence symbols. Introduces the sparse data problem, we need a way to model rare but not possible events. We need to smooth the probability distribution (see Smoothing (Jurafsky Martin) for more detail).

Introduces Laplace smoothing (add one to all the counts):

\[P_{LP}(W_i | w_{i-1}) = \frac{c(w_{i-1}, w_i)+1}{c(w_{i-1}) + V}\]

Note the addition of the length of the vocabulary in the denominator. This is crude, not often used. Instead we would use Good-Turing or Nessen-Kay smoothing.

Alternatively we might use backoff, where we use smaller context if we lack sufficient evidence at a higher context (eg trigram -> bigram -> unigram). We might weight the probabilities lower if we have to back off.

Otherwise we might use interpolation where we take all three of unigrams, bigrams, and trigrams and weight them with lambdas that sum to 1.

Evaluating Language Models

See Language Model Evaluation (Jurafsky Martin) for more details.

How do we measure the quality of our model? Our model should prefer likely sequences over unlikely ones. Like many ML tasks we train and test using different data sets.

Gold standard is to use it in a real app, and see if it helps, this is extrinsic evaluation. This might be expensive though, or impossible because the app doesn’t exist yet.

We might have to use intrinsic evaluation, using perplexity as our metric. The best LM is the one that best predicts our test set.

Perplexity is the inverse probability of the test set, normalized by the number of words.

Draws on Jurafsky/Martin to intuit perplexity as analogous to the branching factor in the dataset. Uses the same example of a random string of digits [0-9]. The perplexity ends up as 10, regardless of the length of the string, which makes sense as for any prior string there is a branching factor of 10 choices that could be made at random. Compare that to phone numbers. How do the patterns impact the perplexity? Perplexity would be lower since choice is constrained by the rules of the phone number so branching factor would be reduced…

Topic Modelling

Lectures then shift gear to topic modelling. This differs from language modelling as:

  • rather than modelling word sequence probabilities we’re looking for a layer of abstraction that helps describe the contents of a document collection. Colloquially, ‘topics’.

    Uses the technique of Latent Dirchlet Allocation - latent because the topics are not explicit. Uses scikit-learn to do this. We can think of this as dimensionality reduction.

    Uses Scikit Learn’s Count Vectorizer to convert a text collection to a matrix of token counts.

    Uses the library’s implementation of Latent Dirichlet Allocation

    Not deterministic, output varies, try different number of topics.

    The example is taken from the ‘cookbook’ mentioned in further reading above.