Alpaydin: Chapter 02: Supervised Learning
Metadata
Title: Supervised Learning
Number: 2
Core Ideas
Alpaydin introduces supervised learning through the two core cases:
Model Selection and Generalization (2.7)
Then moves on to discuss Model Selection and Generalization
Summary
Then summarizes the core elements of a supervised ML Algorithm:
We have a sample:
\[X = \{x^t,r^t\}_{t=1}^N\]
The sample is independent and identically distributed (iid). Ordering is not important. All instances are drawn from teh same joint distribution \(p(x,r)\). t indexes one of the N instances. \(x^t\) is the arbitrary dimensional input. \(r^t\) is the associated desired output. \(r^t\) might be a real value for regression, 0/1 for two class learning, or a k-dimensional binary factor for k-class classification.
We want to build a good and useful approximation to \(r^t\) using the model \(g(x^t|\theta )\). There are three decisions to make:
The model
We need to choose the model we use in learning, denoted as \(g(x|\theta )\), where \(g(\cdot )\) is the model, x is the input, \(\theta\) are the parameters.
\(\g(\cdot )\) defines the hypothesis class H. A particular value of \(/theta/\) instantiates one hypothesis \(h \in H\). In class learning we might take a rectangle as our model and a set of four coordinates as \(\theta\). In linear regression the model is the linear function of the input, the parameters are the slope and intercept learned from the data.
The model itself, also called the inductive bias, H, is usually chosen by the designer based on their knowledge of the appliation. The hypothesis h is chosen (parameters are tuned), by a learning algorithm using the training set, sampled from \(p(x,r)\).
The loss function
The loss function \(L(\cdot )\) computes the difference between the desired output \(r^t\) and the approximation produced by \(g(x^t|\theta )\), given the current values of \(\theta\).
The loss or approximation error is the sum of losses over the individual instances. In class learning \(L(\cdot )\) checks for equality. In regression we have ordering information and might choose the square of the difference for example. We can state the loss as:
\[E(\theta | X) = \sum_t L(r^t, g(x^t | \theta))\]
The optimization procedure
The optimization procedure finds the \(\theta *\) that minimizes the total error:
\[\theta * = \mathrm{argmin}_\theta E(\theta | X)\]
In polynomial regression we can solve analytically for the optimum. But we’re not always so lucky, in other models and error functions the complexity of the optimization problem becomes important. Is there a global optimum? Or multiple local minima?
For this to work well we need to satisfy three conditions:
The hypothesis class of \(g(\cdot )\) should be large enough to include the unknown function that generated the data in X in noisy form.
There should be enough training data to allow us to pinpoint the correct (or good enough) hypothesis from the hypothesis class.
We should have a good optimization method that finds the correct hypothesis given the training data.
Different machine learning algorithms are defined by the different models (inductive bias, hypothesis class) they choose; the loss measures they employ; and the optimization procedures they use.