NLTK Book Chapter 01: Language Processing and Python
Metadata
Title: Language Processing and Python
Number: 1
Outline
The chapter introduces Python and NLTK, looking at:
What can we achieve by combining simple programming techniques with large quantities of text?
How can we automatically extract key words and phrases that sum up the style and content of a text?
What tools and techniques does the Python programming language provide for such work?
What are some of the interesting challenges of natural language processing?
Getting started
The chapter starts with some basic use of NLTK
import nltk
nltk.download() # grab the data
from nltk.book import * #import the texts
text1.concordance("monstruous") # see the concordance for the parameter
text1.similar("monstruous") # see other words in similar contexts
text1.common_contexts(["monstruous", "very"]) #see shared contexts
text1.dispersion_plot(["citizens", "democracy", "freedom"]) # plot distribution over the text
len(text1) # count tokens (includes duplicates)
sorted(set(text1)) # sort the unique tokens alphabetically
Closer look at Python
The chapter introduces Python data structures, starting with lists, and the idea of a text as a sequence of symbols that can be represented in a list. It introduces list contatenation with the + operator, indexing and slicing with [].
It then turns to variable assignment and strings, introducing the basic string methods:
name = 'Brian'
name[0] # 'B'
name[:4] # 'Bria'
name * 2 # 'BrianBrian'
name + '!' # 'Brian!'
' '.join(['Brian', 'Messiah']) # 'Brian Messiah'
'naughty boy'.split() # ['naughty', 'boy']
Later it introduces some word comparison operators:
Function | Meaning |
---|---|
s.startswith(t) | test if s starts with t |
s.endswith(t) | test if s ends with t |
t in s | test if t is a substring of s |
s.islower() | test if s contains cased characters and all are lowercase |
s.isupper() | test if s contains cased characters and all are uppercase |
s.isalpha() | test if s is non-empty and all characters in s are alphabetic |
s.isalnum() | test if s is non-empty and all characters in s are alphanumeric |
s.isdigit() | test if s is non-empty and all characters in s are digits |
s.istitle() | test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals) |
The chapter reviews a lot of other basic Python, list comprehensions, control flow, functions etc.
Simple Statistics
The chapter presents some simple stats using NLTK:
fdist = FreqDist(text1)
fdist.most_common(50) # 50 most common tokens
fdist['Brian'] # count of token
fdist2 = FreqDist(len(w) for w in text1) # distribution of lengths
A summary of the frequency distribution methods in nltk:
Example | Description |
---|---|
fdist = FreqDist(samples) | create a frequency distribution containing the given samples |
fdist[sample] += 1 | increment the count for this sample |
fdist['monstrous'] | count of the number of times a given sample occurred |
fdist.freq('monstrous') | frequency of a given sample |
fdist.N() | total number of samples |
fdist.most_common(n) | the n most common samples and their frequencies |
for sample in fdist: | iterate over the samples |
fdist.max() | sample with the greatest count |
fdist.tabulate() | tabulate the frequency distribution |
fdist.plot() | graphical plot of the frequency distribution |
fdist.plot(cumulative=True) | cumulative plot of the frequency distribution |
fdist1 ¦= fdist2 | update fdist1 with counts from fdist2 |
fdist1 < fdist2 | test if samples in fdist1 occur less frequently than in fdist2 |
Collocations and Bigrams
A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.
You can extract from a text a list of word pairs, also known as bigrams. This is easily accomplished with the nltk function bigrams()
. use
list(bigrams(text))
to see them as a list.
Natural Language Understanding
The chapter concludes by introducing the following tasks in NLP to motivate further study:
Word sense dismbiguation
eg serve - help with food or drink, hold an office, put ball into play.
or by:
The lost children were found by the searchers (agentive)
The lost children were found by the mountain (locative)
The lost children were found by the afternoon (temporal)
Pronoun Resolution
“who did what to whom”:
The thieves stole the paintings. They were subsequently sold.
The thieves stole the paintings. They were subsequently caught.
The thieves stole the paintings. They were subsequently found. (ambiguous)
Generating Language Output
If we can solve the first two issues we can move on to generate language output in tasks like machine translation and question answering.
Textual Entailment
A lot of attention has been on a task called Recognizing Textual Entailment (RTE). The basic scenario is simple.
Here you have a hypothesis, eg “Sandra Goudie was defeated by Max Purnell” and another short text that seems to be relevant, eg, “Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour candidate Max Purnell and pushing incumbent Green MP Jeanette Fitzsimons into third place.”
The question is, Does the text provide enough evidence for you to accept the hypothesis? In this particular case, the answer is no.
It is very hard to come up with automated methods for making the right decision. RTE Challenges provide data that allow competitors to develop their systems, but not enough data for “brute force” machine learning techniques. Consequently, some linguistic analysis is crucial. In the previous example, it is important for the system to note that Sandra Goudie names the person being defeated in the hypothesis, not the person doing the defeating in the text. Here is another difficult text-hypothesis pair:
Text: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books
Hypothesis: Golinkin has written eighteen books
In order to determine whether the hypothesis is supported by the text, the system needs the following background knowledge: (i) if someone is an author of a book, then he/she has written that book; (ii) if someone is an editor of a book, then he/she has not written (all of) that book; (iii) if someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books.