Alpaydin: Chapter 03: Bayesian Decision Theory
Metadata
Title: Bayesian Decision Theory
Number: 3
Core Ideas
Programming to make inferences from data is a cross between stats and computer science. The framework for making inferences from data comes from stats.
The process by which data is produced is not completely known. Maybe it’s deterministic, but we don’t have access to the complete process. So we model it as random and use probability theory to analyse it.
Analogy with tossing a coin. Maybe if we had complete information about the coin and the environment we could predict the exact outcome of a coin toss. But we don’t. So we can only talk about the probability of the next outcome.
The extra pieces of knowledge that we do not have access to are unobservable variables, call them z. In coin tossing, the only observable variable is the outcome of the toss, call it x. In reality \(x = f(z)\), where f is the deterministic function that defines the outcome from what we cannot observe.
We can’t model the process like that, instead we define the outcome X as a random variable drawn from a probability distribution \(P(X=x)\) that specifies the process. In the coin flip we have a Bernoulli distribution where heads could be defined as success (X=1), and tails failure (X=0).
\(P(X=1)=p_0\) and \(P(X=0)=1-P_0\).
If we have to predict the outcome we predict success if \(p_0 > 0.5\) to minimise the probability of error.
If we don’t know \(P(X)\) and want to estimate from a sample, we are in the realm of statistics. We have a sample containing examples drawn from the probability distribution of the observables. The aim is to build an approximator to it using the sample. If you use the numerical outcome 1 for success, then you can just sum the successes and divide by the number of observations.
Classification
Imagine we’re classifying bank customers into high risk or low risk categories. We denote the customer credibility by a Bernoulli random variable C conditioned on the observables X. Let’s say \(X_1\) is annual income and \(X_2\) is savings. C = 1 indicates a high risk customer, C = 0 indicates low risk.
If we know \(P(C|X_1,X_2)\) we can choose the class based on new observations where \(X_1 = x_1\) and \(X_2 = x_2\). If \(P(C=1|x_1,x_2) > 0.5\) or equivalently \(> P(C=0|x_1,x_2)\) we choose 1, otherwise 0.
The difference from coin tossing is here we condition the random variable on two other observable variables. we can say that x is the vector of observed variables then we need to calculate \(P(C|\mathbf{x})\). In Bayes’ theorem this becomes:
\[P(C|\mathbf{x}) = \frac{P( C)p(\mathbf{x}|C)}{p(\mathbf{x})}\]
We can break this down:
The prior probability \(P(C=1)\) is the probability that a customer is high risk regardless of the value of x. It is the knowledge we have of the value of C before looking at the observables x (hence the prior probability).
The class likelihood \(p(\mathbf{x} | C)\) is the conditional probability that an event of class C has the associated observation value x. In the bank case \(p(x_1,x_2|C=1)\) would be the probability that a high-risk customer has their \(X_1 = x_1\) and \(X_2 = x_2\). This is what our data tells us about our classes.
The evidence \(p(\mathbf{x})\) is the marginal probability that an observation x is seen, regardless of what class it falls under. That is:
\[P(\mathbf{x}) = \sum_C p(\mathbf{x},C)\] \[= p(\mathbf{x}|C=1)P(C=1) + p(\mathbf{x}|C=0)P(C=0)\]
By combining the prior and what the data tells us we can calculate the posterior probability after seeing the observation x.
Risk
The chapter then extends the framework to introduce the idea that different decisions may have different costs attached. We can weigh those costs and choose to minimize our risks. See pp. 55-57 for this.
Association Rules
The chapter also extends the framework to discuss learning association rules, for example in the case of basket analysis, where we want to learn the dependencies between purchasing two items (chips and beer etc..)
It introduces the terms and their probabilistic foundations used in this analysis, such as support, confidence and lift. And efficient algorithms for learning these rules. See pp. 58-61 for this.