NLTK Book Chapter 05: Categorizing and Tagging Words

Metadata

Title: Categorizing and Tagging Words
Number: 5
Book: Bird et al: The NLTK Book
Full text

Core Ideas

The goal of this chapter is to answer the following questions:

What are lexical categories and how are they used in natural language processing?
What is a good Python data structure for storing words and their categories?
How can we automatically tag each word of a text with its word class?

Intro

Using the NLTK POS tagger is straight forward, we can tokenize a text and then call nltk.pos_tag(text) to get a list of tuples with the tokens and their POS. We can represent the POS in a text string with the format “word/POS” eg “jury/NN”, and this is how many of the NLTK corpora are recorded. NLTK has methods for parsing such text representations into their tuple forms.

Corpora will use different taggings, NLTK has a mapping to a universal tagset to help ease the interface. The universal POS tagset is as follows:

Tag	Meaning	English Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner, article	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy

Parts of Speech

Nouns generally refer to people, places, things, concepts. Nounds can appear after determiners and adjectives, and can be the subject or object of a verb. The simplified noun tags are N for common nouns like book, and NP for proper nouns (like Scotland).
Verbs describe events and actions, in the context of a sentence verbs typically express a relation involving the referents of one or more noun phrases. Verb tags include VBD past tense and VBN past participle.
Adjectives describe nouns and can be used as modifiers (eg “the large pizza”) or in predicates (“the pizza is large"). Adjectives can have internal structure (eg fall + ing in falling stocks).
Adverbs modify verbs to specify the time, manner, place, or direction of the event described by the verb. Adverbs may also modify adjectives (eg Mary’s teacher was really nice).
Articles or determiners eg the, a.
Modals eg should, may.
Personal pronouns eg she, they.

Python Dictionaries

Reviews Python dictionary methods, most notably how to use default dictionaries (ie dictionaries that return a default value when an invalid key is used in a look up), eg with defaultdict(int) which will use 0 as the default or defaultdict(lambda:'UNK') which will have a default of ‘UNK’.

Building Taggers

The chapter walks through a variety of increasingly sophisticated tagging methods.

First it just tags everything as a Noun, this is a default tagger. This performs poorly, 1/8 tokens are correctly tagged. However, after processing several thousand words of English text most new words will be nouns. As such a default tagger can help improve the robustness of POS taggers.

Next step is a regular expression tagger where we look at some representative regexes (eg *ould ) for modals, numbers for numbers… and then failing them fall back to the default.

Next step is a lookup tagger where we lookup the N most frequent words and store their most likely tag (from a pre-tagged corpus). We can then backoff to the default for less frequent words.

So we can combine these by first trying the lookup tagger, falling back to the regex one, then falling back to the default.

N-Gram Taggers

A unigram tagger is like a lookup tagger, except easier to set it up in nltk, you can just pass a constructor method a tagged corpus and it will do it for you by assigning the highest frequency tag for each unigram to future test cases: unigram_tagger = nltk.UnigramTagger(tagged_corpus)

We can generalize this to include more of the context, creating bigram or trigram taggers for example. We can use backoff here too, so tag according to the trigram, then backoff to the bigram.

Taggers take a while to train, so we can pickle them to load them up again when we need them.

Transformation Taggers

Transformation taggers like the Brill tagger apply rule sequences to successively improve their tagging. The rules are linguistically interpretable.

Determining Word Categories

The chapter concludes by looking at a range of hints for determining the category of a word:

Morphological clues eg ness is suffix that combines with an adjective to form a noun.
Syntactic clues eg an adjective can occur immediately before a noun
Semantic clues eg the definition of a noun “the name of a person, place or thing”. These are hard to formalize.
New words - prepositions are regarded as a closed class there are a limited set of words belonging to the class, membership changes very gradually. Nouns are an open class we invent new nouns all the time. New words are likely to be nouns.

Alex's Notes