Improving Generalization (Chollet)

Section 4 of Chollet: Chapter O5: Fundamentals of Machine Learning

You first need to ensure your model can overfit, see Improving Model Fit (Chollet)

Then you can focus on maximizing generalization.

Datasets

Generalization in Deep Learning originates from the latent structure in your data. Deep Learning is curve fitting, not magic.

So it’s essential that you’re working with an appropriate dataset. Spending more effort and money on data collection is a much better investment than model development.

Make sure you have enough data. You need a dense sampling of the input-cross-output space. Larger datasets can make impossible problems suddenly possible.
Minimize labeling errors. Visualize inputs, check for anomalies, proofread your labels.
Clean your data and deal with missing values (see ch. 6)
If you have many features, do feature selection

For most ML problems, feature engineering is a key ingredient.

Feature Engineering

Feature engineering is the process of using your own knowledge about the data and about the machine learning algorithm at hand… to make the algorithm work better by applying hardcoded (non-learned) transformations to the data before it goes into the model. In many cases, it isn’t reasonable to expect a machine learning model to be able to learn from completely arbitrary data. The data needs to be presented to the model in a way that will make the model’s job easier.

To take a toy example. Let’s say you have images of a clock and you want to train a model to tell the time of day from the images. If you start with the raw pixels, you have a difficult ML problem on your hands, and will need a CNN to solve it which will be v. expensive to train.

But you understand the problem at a high level already, you know how humans read a clock face. You could write a script to extract the coordinates of the tip of each hand, or even better the angle of the hand. At that point you don’t even need machine learning at all, you can just lookup the answer in a table after rounding.

This is the essence of feature engineering, make a problem easier by expressing it more simply. Make the latent manifold smoother, simpler, better organized. Doing so requires understanding the problem in depth.

Before DL, feature engineering was the most important part of the ML workflow as shallow algorithms don’t have hypothesis spaces rich enough to learn useful features by themselves.

Modern DL removes the need for most feature engineering, because NNs are capable of automatically extracting useful features from raw data. But you still need to be aware of it. Good features still allow you to solve problems more elegantly, with fewer resources (please don’t train a CNN to read a clock face). They also let you solve problems with less data.

Early Stopping

In DL we always use models that are over-parameterized: they have way more degrees of freedom than they need to fit the data manifold. This isn’t an issue because you never fully fit a model. You always interrupt training before you reach the minimum training loss.

Finding the exact point during training where you’ve reached the most generalizable fit - the exact boundary between an underfit curve and an overfit curve - is one of the most effective things you can do to improve generalization.

You could train your models for longer than needed, figure out the best number of epochs, and then train the model again for that number of epochs. But that involves redundant work. Better to save your model at the end of each epoch and interrupt training when the validation metrics have stopped improving, remembering the best known model state. This is done in Keras with the EarlyStopping callback.

Regularization

Regularization techniques are a set of best practices that actively impede the model’s ability to fit perfectly to the training data. It’s called regularizing the model because it makes the model more “regular”, simpler, with a smoother curve, more generic.

It should always be guided by an accurate evaluation procedure. You can only achieve generalization if you can measure it. Here are some common techniques:

Reduce the Network’s Size

A model that’s too small will not overfit. The simplest way to mitigate overfitting is to reduce the model size (ie the number of learnable parameters in the model, determined by the number of layers and the number of units per layer). If the model has limited memorization resources, it won’t be able to simply remember its training set.

To minimize its loss, the model will have to learn compressed representations with predictive power - exactly what we want it to do. There’s a compromise between too much capacity and not enough here.

There’s no magic formula for this, you have to experiment and evaluate on an array of different architectures to find the correct model size. Generally we start with relatively few layers and parameters, and increase the size of the layers or add new ones until we see diminishing returns.

You know your model is too large if it starts overfitting straight away and the validation loss curve looks choppy with high-variance. Choppy validation metrics could also be a sign that your validation split is too small, or your validation process is otherwise unreliable. The more capacity the model has the more quickly it will model the training data (lowering training loss), but the more susceptible it is to overfitting.

Adding Weight Regularization

Occam’s razor applies to the models learned by NNs. Given training data and a network architecture, multiple models (ie multiple sets of weight values) could explain it. Simpler models are less likely to overfit than complex ones.

A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters).

So we might put constraints on the complexity of a model by forcing its weights to take only small values, making the distribution of the weight values more regular. This is weight regularization. We do this by adding to the loss function of the model a cost associated with having large weights. There are two flavours of this cost:

L1 regularization - The cost is proportional to the absolute value of the weight coefficients. This is the L1 norm of the weights.
L2 regularization - The cost is proportional to the square of the value of the weight coefficients. This is the L2 norm of the weights. Also called weight decay in the context of NNs. The terms mean the same thing.

The loss is only added at training time. So the training loss will be higher than the test loss.

Weight regularization is more typically used for smaller DL models. Large models are so overparamterized that imposing weight constraints doesn’t have much impact. Here we prefer dropout.

Dropout

Developed by Geoff Hinton and colleagues, this is one of the most effective and widely used regularization techniques for NNs.

Dropout, applied to a layer, consists of randomly dropping out (setting to zero) a number of output features of the layer during training.

The dropout rate is the fraction of the features that are zeroed out, usually between 0.2 and 0.5. At test time, no units are dropped out, instead the layer’s output values are scaled down by a factor equal to the dropout rate, balancing the fact that more units are active than at training time.

So if a layer should output [0.2,0.5,1.3,0.8,1.1] it would have a few zero entries distributed at random like [0,0.5,1.3,0,1.1].

In practice we often just leave the output unchanged at test time, and apply both operations at training time instead (ie we divide the training weights by the dropout rate rather than multiplying the test weights by the same).

Why does this work? The inspiration came to Hinton from the random moving of bank tellers to prevent conspiracy to defraud. The core idea is that introducing noise in the output values of a layer can break up chance patterns that aren’t significant.

Alex's Notes