Alex's Notes

cm3015 Topic 05: Probabilistic Classifiers

Main Info

Description

This topic introduces ways of dealing with uncertainty in classification tasks, specifically introducing algorithms that classify using probabilities. It also describes the popular Naïve Bayes classifier, based on Bayes’ theorem.

Key Reading

There are readings from Murphy’s first edition that don’t map very well on to the new edition. The following are relevant sections of the new edition:

Other Reading

Lecture Summaries

5.101 Bayesian Classification

Probabilistic modeling lets us do ML while dealing with the concept of uncertainty. With probabilistic modeling we can output degrees of uncertainty, or probabilities, rather than binary yes/no decisions.

Generally when we talk about probabilistic modeling, we have Bayesian modeling in mind. Bayesian modeling is based on the idea that new evidence can change our decisions, as the facts come our mind changes.

The basic elements are:

  • The prior probability, the initial degree of belief in some proposition. \(P(x)\)

  • The posterior probability, the belief in some proposition after seeing some evidence. \(P(x|e)\)

    Ulimately we want to calculate the posterior probability of some event given some evidence. But this is hard to do directly. We’d need lots of examples of the evidence, and the outcomes.

    But we can do so indirectly using a generative model, and a much smaller dataset, using Bayes’ theorem:

    \[Posterior = \frac{Likelihood \cdot Prior}{Marginal \; Probability}\]

    Informally:

    • The prior is “what we believed about the event beforehand”

    • The likelihood is “the likelihood the evidence was generated by the event”

    • The posterior is the “probability of the event given the evidence”

    • The marginal probability ensures the posterior sums to 1 over all possible events.

    Let’s look at some of the basic probability rules that inform the theorem.

    RuleFormulationComment
    Inverse\(P(\bar{A})=1-P(A)\)
    Conditional\(P(B \vert A)\)
    Product Rule\(P(B,A) = P(B \vert A)P(A) = P(A \vert B)P(B)\)Symmetric
    Sum Rule$P(B) = P(B,A) + P(B,\bar{A}) $Marginalizes A
    \(= P(B\vert A)P(A) + P(B \vert \bar{A})P(\bar{A}\)
    \(= \sum_B P(A,B)\)Alternative - sum for all possible values of B

    So we have:

    \[P(Y,X) = P(X,Y)\]

    \[P(Y|X) \cdot P(X) = P(X|Y) \cdot P(Y)\]

    \[P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)}\]

    which is Bayes’ theorem.

    Here’s an example to see it in practice.

    You take a test for a rare disease and it’s positive. Oh no! 99% of people with the disease test positive. Only 1% of healthy people test positive. The disease occurs randomly in 1 out of 10,000 people.

    How likely is it that you have the disease?

    We’re trying to predict our likelihood of the disease given the positive test. So we have the following application of Bayes’ theorem:

    \[\frac{P(pos|disease)P(disease)}{P(pos)}\]

    The likelihood is 0.99, the prior is 0.0001, the marginal is 0.01. So the posterior is \(\frac{0.99 \cdot 0.0001}{0.01}\).

    The only tricky bit is the marginal probability. This is the probability of a positive result given having the disease multiplied by the probability of having the disease, plus the probability of a positive result without the disease multiplied by the probability of not having the disease.

    In a classifier, we might dispense with the marginal probability normalization. On the one hand this means we don’t have a probability strictly, but it means we can combine the prediction along with other models (for example an image classifier predicting a goat or a cat) - we can use a maximum a posteori decision - the model that produces the highest score is the winner and we predict that label. This takes into account the priors, unlike a maximum likelihood classifier, which just compares the likelihood.

5.103 The Naive Bayes Classifier

A pure Bayes classifier Works well for a small number of features, but more problematic with more dimensions.

One issue is the dependence of the probabilities, which requires use of the chain rule.

But we can make use of an assumption - that the features are independent. Then we can evaluate them independently. This is the naivety, because it’s not likely that they are independent. But in practice it works well.

Discusses the issue of numerical underflow, with the solution being taking the log. With logs we switch to summation as follows:

\[P(Y|X_1, X_2,\dots ,X_F) \propto P(Y) \cdot \prod^F_{i=1} P(X_i|Y)\]

\[\log P(Y | X_1, X_2,\dots ,X_F) \propto \log P(Y) + \sum^F_{i=1} \log P(X_i|Y)\]

\[NB = \mathrm{argmax}_Y \left( \log P(Y) + \sum^F_{i=1} \log P(X_i | Y) \right) \]

Very simple tool, but can actually outperform more complex models frequently. Spam detectors for example might often still involve NB classification.

Lab Summaries

The lab has you implement a Naive Bayes classifier from scratch.