Alpaydin: Chapter 04: Parametric Methods
Metadata
Title: Parametric Methods
Number: 4
Core Ideas
The previous chapter Chapter 03: Bayesian Decision Theory showed us how to make decisions when we know the relevant probabilities. Now we focus on how we estimate those probabilities from a given training set.
A statistic is any value that is calculated from a given sample. In statistical inference we make a decision using the information provided by a sample.
Our first approach is parametric where we assume the sample is drawn from a distribution that obeys a known model (eg Gaussian). The advantage here is that the model is defined up to a small number of parameters - the sufficient statistics of the distribution. For example the mean and variance. Once we estimate those parameters, the whole distribution is known.
So we estimate the parameters of the distribution from the given sample data. We plug in those estimates into the assumed model. And we get an estimated distribution.
To estimate the parameters of a distribution we use Maximum Likelihood Estimation. This chapter covers univariate cases, the next generalizes to multivariate cases.
Maximum Likelihood Estimation
We have an independent and identically distributed (iid) sample: \(X = \{ x^t \}^N_{t=1}\). Assume that the \(x^t\) instances are drawn from some probability density family, \(p(x|\theta )\) defined up to the parameters \(\theta\).
So we want to find θ that make sampling \(x^t\) as likely as possible. Because the observations are independent the likelihood of parameter θ given sample \(X\) is the product of the likelihoods of the individual points:
\[l(\theta |X) \equiv p(X|\theta) = \prod^N_{t=1}p(x^t|\theta)\]
In maximum likelihood estimation we want to find θ that makes X the most likely to be drawn. We can compute the log likelihood without changing things, which gives us a sum instead of product and other benefits. So we have the following log likelihood
\[L(\theta | X) \equiv \log l(\theta | X) = \sum^N_{t=1} \log p(x^t| \theta)\]
The chapter shows how this works with Bernoulli, Multinomial and Gaussian distributions.
For Bernoulli, the estimate is \(\frac{\sum_t x^t}{N}\).
The generalization of Bernoulli is a multinomial distribution where the estimate is \(\hat{p}_i = \frac{\sum_t x^t_i}{N}\) where the class is represented by \(x_i = 1\) and otherwise 0 for \(x_1 \dots x_k\) indicator variables.