Alex's Notes

NLTK Book Chapter 01: Language Processing and Python

Metadata

Outline

The chapter introduces Python and NLTK, looking at:

  1. What can we achieve by combining simple programming techniques with large quantities of text?

  2. How can we automatically extract key words and phrases that sum up the style and content of a text?

  3. What tools and techniques does the Python programming language provide for such work?

  4. What are some of the interesting challenges of natural language processing?

Getting started

The chapter starts with some basic use of NLTK

import nltk
nltk.download() # grab the data
from nltk.book import * #import the texts

text1.concordance("monstruous") # see the concordance for the parameter
text1.similar("monstruous") # see other words in similar contexts
text1.common_contexts(["monstruous", "very"]) #see shared contexts

text1.dispersion_plot(["citizens", "democracy", "freedom"]) # plot distribution over the text

len(text1) # count tokens (includes duplicates)
sorted(set(text1)) # sort the unique tokens alphabetically

Closer look at Python

The chapter introduces Python data structures, starting with lists, and the idea of a text as a sequence of symbols that can be represented in a list. It introduces list contatenation with the + operator, indexing and slicing with [].

It then turns to variable assignment and strings, introducing the basic string methods:

name = 'Brian'
name[0] # 'B'
name[:4] # 'Bria'
name * 2 # 'BrianBrian'
name + '!' # 'Brian!'
' '.join(['Brian', 'Messiah']) # 'Brian Messiah'
'naughty boy'.split() # ['naughty', 'boy']

Later it introduces some word comparison operators:

FunctionMeaning
s.startswith(t)test if s starts with t
s.endswith(t)test if s ends with t
t in stest if t is a substring of s
s.islower()test if s contains cased characters and all are lowercase
s.isupper()test if s contains cased characters and all are uppercase
s.isalpha()test if s is non-empty and all characters in s are alphabetic
s.isalnum()test if s is non-empty and all characters in s are alphanumeric
s.isdigit()test if s is non-empty and all characters in s are digits
s.istitle()test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals)

The chapter reviews a lot of other basic Python, list comprehensions, control flow, functions etc.

Simple Statistics

The chapter presents some simple stats using NLTK:

fdist = FreqDist(text1)
fdist.most_common(50) # 50 most common tokens
fdist['Brian'] # count of token
fdist2 = FreqDist(len(w) for w in text1) # distribution of lengths

A summary of the frequency distribution methods in nltk:

ExampleDescription
fdist = FreqDist(samples)create a frequency distribution containing the given samples
fdist[sample] += 1increment the count for this sample
fdist['monstrous']count of the number of times a given sample occurred
fdist.freq('monstrous')frequency of a given sample
fdist.N()total number of samples
fdist.most_common(n)the n most common samples and their frequencies
for sample in fdist:iterate over the samples
fdist.max()sample with the greatest count
fdist.tabulate()tabulate the frequency distribution
fdist.plot()graphical plot of the frequency distribution
fdist.plot(cumulative=True)cumulative plot of the frequency distribution
fdist1 ¦= fdist2update fdist1 with counts from fdist2
fdist1 < fdist2test if samples in fdist1 occur less frequently than in fdist2

Collocations and Bigrams

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

You can extract from a text a list of word pairs, also known as bigrams. This is easily accomplished with the nltk function bigrams(). use list(bigrams(text)) to see them as a list.

Natural Language Understanding

The chapter concludes by introducing the following tasks in NLP to motivate further study:

Word sense dismbiguation

eg serve - help with food or drink, hold an office, put ball into play.

or by:

  • The lost children were found by the searchers (agentive)

  • The lost children were found by the mountain (locative)

  • The lost children were found by the afternoon (temporal)

Pronoun Resolution

“who did what to whom”:

  • The thieves stole the paintings. They were subsequently sold.

  • The thieves stole the paintings. They were subsequently caught.

  • The thieves stole the paintings. They were subsequently found. (ambiguous)

Generating Language Output

If we can solve the first two issues we can move on to generate language output in tasks like machine translation and question answering.

Textual Entailment

A lot of attention has been on a task called Recognizing Textual Entailment (RTE). The basic scenario is simple.

Here you have a hypothesis, eg “Sandra Goudie was defeated by Max Purnell” and another short text that seems to be relevant, eg, “Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour candidate Max Purnell and pushing incumbent Green MP Jeanette Fitzsimons into third place.”

The question is, Does the text provide enough evidence for you to accept the hypothesis? In this particular case, the answer is no.

It is very hard to come up with automated methods for making the right decision. RTE Challenges provide data that allow competitors to develop their systems, but not enough data for “brute force” machine learning techniques. Consequently, some linguistic analysis is crucial. In the previous example, it is important for the system to note that Sandra Goudie names the person being defeated in the hypothesis, not the person doing the defeating in the text. Here is another difficult text-hypothesis pair:

Text: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books

Hypothesis: Golinkin has written eighteen books

In order to determine whether the hypothesis is supported by the text, the system needs the following background knowledge: (i) if someone is an author of a book, then he/she has written that book; (ii) if someone is an editor of a book, then he/she has not written (all of) that book; (iii) if someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books.