Alex's Notes

cm3060 Topic 05: Text Categorization

Main Info

Description

This week we explore the theory underpinning generative approaches to text categorisation and experiment with applying this to real world data.

Key Reading

Other Reading

Lecture Summaries

Week Nine: Text Classification

Intro to text categorization

Introduces spam classification. Then introduces the history of work on the problem of categorization. Starts with 1963 work on Bayesian resolution of Federalist Paper authorship, then topical literature classification.

Formalizes the problem:

We have a fixed set of classes \(C = \{c_1,c_2,\dots ,c_n\}\). A set of labelled documents \((d_1,c_1),\dots ,(d_m,c_m)\). And a document d we want to classify. We need a classifier \(\gamma : d \rightarrow c\).

How do we do this? We could try using rules, but these are difficult to scale and maintain. So we tend to use machine learning now and that’s what the module will focus on. There’s a variety of techniques: naive bayes, logistic regression, support vector machines, or neural approaches.

Bayes’ Theorem and Classification

Introduces the bias in human reasoning that we ignore prior probabilities and look only at the current evidence. A rational judgement would consider both evidence and prior probability.

This leads us to Bayes’ theorem, where we look at the probability of a hypothesis given the evidence: \(P(H|E) = \frac{P(H)P(E|H)}{P(E)}\)

Introduces the terms:

  • The Prior, this is \(P(H)\)

  • The Likelihood, this is \(P(E|H)\)

  • The Posterior, this is \(P(H|E)\)

  • The Marginal likelihood or model evidence, this is \(P(E)\)

Evidence should be used to update beliefs, not determine them.

Presents an example with librarians and farmers, then generalizes:

Then turns in the next lecture to Naive Bayes as a baseline classifier.

NB the examples in the lectures are lifted directly from VanderPlas: Python Data Science Handbook, chapter 5.

Very useful as a baseline as it’s fast, with few tunable parameters, suitable to high-dimensional datasets.

How do we adapt the theorem for classification, we can think in terms of features and labels:

\[P(L|features) = \frac{P(L)P(features|L)}{P(features)}\]

To choose between labels we can compare their posterior probabilities:

\[\frac{P(L_1|features)}{P(L_2|features)} = \frac{P(L_1)P(features|L_1)}{P(L_2)P(features|L_2)}\]

So we need to calculate \(P(features|L_i)\) for each label. We can assume a generative model for each label -> features. We specify this model as part of training the Bayesian classifier.

We apply some ‘naive’ assumptions: independence and equality of features’ contributions.

Demonstrates an example of a NB classifier with scikit learn (not text related).

Multinomial Naive Bayes

Here features are generated from a multinomial distribution, so we’re looking at the probability of observing counts across various categories. Find a best-fit multinomial distribution.

Looks at tfidf vectorizer as a way of generating numerical representations of documents and walks through a scikit learn workflow on an existing dataset. The example is taken from VanderPlas: Chapter 05: Machine Learning.

Week Ten: Sentiment Analysis

Introduction to Sentiment Analysis

Gives some examples of sentiment in book reviews (of his own book!).

Introduces some contexts for sentiment analysis. Customer voice and product reviews, public opinion, market prediction.

Introduces the Scherer taxonomy of affective states:

  • Personality traits: stable personality dispositions and typical behaviour (persist over long term). EG nervous, reckless, jealous, morose

  • Mood: diffuse non-caused low-intensity long-duration change in subjective feeling. eg cheerful, gloomy, listless, irritable.

  • Emotion: brief, organically synchronized, evaluation of a major event. Angry, sad, joyful, ashamed, proud, elated, afraid

  • Interpersonal stances: affective stance towards another person in a specific interaction: friendly, flirtatious, distant, cold, warm, contemptuous

  • Attitudes: enduring, affectively coloured beliefs, dispositions towards objects or persons, liking, loving, hating, desiring.

Simplest form of sentiment classification is just positive/negative. With NB we estimate priors, we ignore unknown words, then estimate likelihoods, and compare scores for each class.

Shows an example of classifying a new review with LaPlace smoothing.

For sentiment, occurrence matters more than frequency. We tend to limit counts to binary presence/absence and not bother with TF-IDF or similar.

Negation is obviously really important. A simple baseline is to add a pseudo token NOT_ to any token after the negation before punctuation.

Finally introduces sentiment lexicons. Useful when training data is limited. Some examples:

One way to use these is to add a feature in classification for whether a word is a member of one of the +ve/-ve lexicons. Could keep running count of these.

Sentiment Analysis using Supervised Learning and Lexicons

The next two lectures walk through some sentiment analysis using nltk. NLTK comes with some classifiers built in, including a NB classifier and DT classifier. Lecture sets up a boolean vector for each document based on inclusion of terms, then trains a couple of classifier models.

This is how he does feature extraction from the tokenized doc:


def document_features(document):
    document_words = set(document)
    features = {}
    #word_features is a corpus list of top N word tokens
    for word in word_features:
	  features['contains({})'.format(word)] = (word in document_words)
    return features

For the lexicon lecture, it features the textblob library, which wraps NLTK and the pattern library to provide a simple API for text processing (see the docs).

,#+BEGIN_SRC python results: silent exports: code from textblob import Textblob

  def get\_blob\_sentiment(sentence):
sentiment = TextBlob(sentence).sentiment
return sentiment.polarity

#+END_SRC

Then shows the approach with NLTK:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

def get_nltk_sentiment(sentence):
  sid = SentimentIntensityAnalyzer()
  ss = sid.polarity_scores(sentence)
  return ss['compound']

Lab Summaries

Week Nine lab walks through the multinomial classificaiton problem with no real coding, just playing with the existing code and trying different categories.

Week Ten lab has you try to improve the performance of a sentiment classifier by feature engineering.