NLTK Book Chapter 06: Learning to Classify Text

Metadata

Title: Learning to Classify Text
Number: 6
Book: Bird et al: The NLTK Book
Full text ->

Summary

The goal of this chapter is to answer the following questions:

How can we identify particular features of language data that are salient for classifying it?
How can we construct models of language that can be used to perform language processing tasks automatically?
What can we learn about language from these models?

Introduction

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance.

Examples include spam classification, deciding the topic of a news article, and categorizing the occurence of a token like bank in its correct semantic category.

Variations on the basic task include multi-class classification, open-class classification (set of labels is not pre-defined), and sequence classification where the list of inputs is jointly classified. A classifier is supervised if it is built based on a training corpora containing the correct label for each input.

NLTK provides some built-in support for standard classification algorithms, however it’s advised to only use them on small datasets. With large datasets it’s advised to use external ML frameworks.

Feature Engineering

The chapter demos a starter example of classifying names into masculine and feminine categories. It intros the nltk.classify package. It also introduces the nltk.classify.applyfeatures() method which returns a generator, saving you from building the full list in memory.

Most of the work in traditional ML classification is feature engineering - deciding what might be relevant and how to represent it. Usually there are significant gains to performance if we construct features based on understanding the task at hand. Typically we build these features through trial and error, guided by intuition. We can start with a ‘kitchen sink’ approach, adding all the features you can think of, and then seeing what the model finds useful.

One very useful approach is error analysis. Here we split the corpus into a train/dev-test/test trio and analyse the errors in the dev-test set to see where the model is failing. We can inspect wrong predictions, manually see patterns in them, and include representations of those patterns in the feature vector next time. Now though we need the independent third data set for final testing, since we’ve used the dev-test to train the model effectively.

The chapter includes examples of document classification (shown in the module videos) and training a POS tagger using a decision tree, including the ability to inspect the resulting decision tree.

It includes a demonstration of contextual features where the feature extraction method is passed the full sentence and the index of the target token, so that it can, for example, extract the prior word:

def pos_features(sentence, i): [1]
  features = {"suffix(1)": sentence[i][-1:],
	      "suffix(2)": sentence[i][-2:],
	      "suffix(3)": sentence[i][-3:]}
  if i == 0:
      features["prev-word"] = "<START>"
  else:
      features["prev-word"] = sentence[i-1]
  return features

Even with this though, the classifier only has access to the previous token, and not the previous token’s POS tag, so it can’t generalize as effectively. For that we need sequence classification.

Sequence Classification

Joint classification models choose an appropriate label for a collection of related inputs. A variety of sequence classification models can be used for tasks like POS tagging.

One method is greedy sequence classification or consecutive classification where we choose the most likely class label for the first token, then use that answer to help find the best label for the next token. In this method we don’t look ahead, we classify only on the basis of the token and its history. An example classifier of this kind is shown in the book (section 1.6).

There is no way to go back in this approach and change a past decision. One way around this is to use a transformational strategy instead. Transformational joint classifiers work by creating an initial assignment of labels for inputs and then iteratively refine the assignment in an attempt to repair inconsistencies between related inputs. The Brill tagger is an example of this.

Another way round is to assign scores to all of the possible sequences of POS tags, and choose the overall highest one. This is the approach of Hidden Markov Models. A probability distribution is created for tags in an input sequence and the highest overall probability is chosen.

Classification Tasks

Sentence segmentation can be seen as a classification task for punctuation. Given a potentially sentence-ending symbol, we have to decide whether it terminates the preceding sentence. Here’s a simple example of using a pre-segmented corpus to train a segmenter:


sents = nltk.corpus.treebank_raw.sents()
tokens = []
boundaries = set()
offset = 0
for sent in sents:
  tokens.extend(sent)
  offset += len(sent)
  boundaries.add(offset-1)

def punct_features(tokens, i):
  return {'next-word-capitalized': tokens[i+1][0].isupper(),
	  'prev-word': tokens[i-1].lower(),
	  'punct': tokens[i],
	  'prev-word-is-one-char': len(tokens[i-1]) == 1}

featuresets = [(punct_features(tokens, i), (i in boundaries))
	       for i in range(1, len(tokens)-1)
	       if tokens[i] in '.?!']

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

def segment_sentences(words):
  start = 0
  sents = []
  for i, word in enumerate(words):
      if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
	  sents.append(words[start:i+1])
	  start = i+1
  if start < len(words):
      sents.append(words[start:])
  return sents

The chapter also demonstrates dialogue act classification (is an utterance an instance of ‘statement’, ‘emotion’, ‘ynQuestion’, ‘continuer’ etc).

It also provides an example of textual entailment, where the problem is given a piece of text T and a hypothesis h does the text T provide reasonable evidence for taking the hypothesis to be true.

Evaluation

The chapter introduces the standard methods of evaluating ML models, including separating out a test set, metrics like accuracy, precision, recall, F-measure: (2 * precision * recall) / (precision + recall).

It recommends confusion matrices where there are more than two labels. It also encourages k-fold cross-validation.

Algorithms

The chapter introduces decision trees, Naive Bayes, and Maximum Entropy classifiers in more detail. It draws the distinction between generative and conditional classifiers (with generative classifiers able to answer more questions, but harder to train).

It also mentions that classifiers can be used to show linguistic patterns more explicitly.

Alex's Notes