Alex's Notes

Model Selection and Generalization (Alpaydin)

Section 2.7 of Alpaydin: Chapter 02: Supervised Learning

Take the simple example of learning a boolean function from examples (where all inputs and outputs are binary). With d inputs there are \(2^d\) possible distinct examples, each of which could map to 0 or 1 in the function, so there are \(2^{2^d}\) possible boolean functions from d inputs (these are the hypotheses).

Each distinct training example removes half of the possible hypotheses eg if \(x_1 = 1\) and \(x_2 = 0\) and the function should map to 0, we can eliminate all that map to 1. This is one way to think about the learning process, we start with all possible hypotheses and remove them as they prove inconsistent with the training data.

If we had every single one of the \(2^d\) distinct examples we could eliminate all but one hypothesis. But typically we only have a small subset, say N examples. Then we shall be left with \(2^{2^{d-N}}\) possible functions. We have an ill-posed problem, the data is not sufficient to find a unique solution.

This occurs in classification and regression too. We see more examples, we learn more about the underlying function, but we are still left with many consistent hypotheses. We need to make some additional assumptions, beyond the data, to have a unique solution with the data we have.

The set of assumptions we make to ensure learning a unique solution is possible is called the inductive bias of the learning algorithm.

In linear regression, assuming a linear function is an inductive bias. Choosing the line that minimizes squared error is another inductive bias.

In classification, assuming the categories fit the shape of a rectangle would be an inductive bias, choosing the rectangle with the largest margin another one.

Each hypothesis class has a certain capacity and can learn only certain functions. We can extend that capacity by using a hypothesis class with greater capacity, more complex hypotheses. Eg in regression we can increase the order of the polynomial. In classification we might take the union of two rectangles. But when do we stop?

learning is not possible without inductive bias, and now the question is how to choose the right bias. This is called model selection, which is choosing between possible \(H\).

Remember the aim is not to map perfectly to the training set, but accurately predict inputs outside it. How well it does this is called generalization.

For best generaliation, we should match the complexity of the hypothesis class H with the complexity of the function underlying the data.

Underfitting

If H is less complex than the function, we have underfitting, for example when trying to fit a line to data sampled from a third order polynomial. In such a case, as we increase the complexity, the training error decreases.

Overfitting

If we have H that is too complex, the data is not enough to constrain it and we may end up with a bad hypothesis, \(h \in H\), for example when fitting two rectangles to data sampled from one rectangle. Or if there is noise, an overcomplex hypothesis may learn not only the underlying function but also the noise in the data and may make a bad fit, for example, when fitting a sixth-order polynomial to noisy data sampled from a third-order polynomial. This is called overfitting. In such a case having more training data helps but only up to a certain point. Given a training set and H, we can find \(h \in H\) that has the minimum training error, but if H is not well chosen no matter which h we pick we won’t have good generalization.

The Triple Trade-Off

In all learning algorithms trained from examples there is a trade-off between three factors:

  • The complexity of the hypothesis we fit to the data, ie the capacity of the hypothesis class.

  • The amount of training data.

  • The generalization error on new examples.

    As training data increases, generalization error decreases.

    As complexity of H increases, generalization error first decreases and then increases.

Measuring Generalization: Cross-Validation

We can measure generalization ability if we have access to data outside the training set.

We use one part of our data for training, the rest is called the validation set and used to test the generalization ability.

Given a set of possible hypothesis classes \(H_i\), for each we fit the best \(h_i \in H_i\) on the training set. The hypothesis that is most accurate on the validation set is the best one (the one with the best inductive bias).

This is called cross-validation.

EG: we want to find the right order for a polynomial regression. For each order we find the coefficients on the training set, calculate their errors on the validation set, and pick the one with the least validation error as the best polynomial.

Note that in this process, the validation set becomes part of the training process. So we can’t really use it to report the error rate of a model. We need a third set (a test set or publication set) that we reserve with examples not uses in training or validation. We also can’t keep using the same training/validation split, since after first use the validation set is part of the training data.

Remember the training data we use is a random sample, if we collect data again we’ll have a different sample, and a different h. If we divide the same dataset differently into train/validate/test splits we’ll get different h and different error rates. Section 20.6 of the book covers techniques like k-fold cross validation and bootstrapping to manage the statistical significance of the experiments we run.