Chollet: Chapter 08: Intro to DL for Computer Vision

Metadata

Title: Introduction to Deep Learning for Computer Vision
Number: 8
Book: Chollet: Deep Learning with Python

Core Ideas

Computer Vision is the problem domain that led to the modern rise of DL, starting around 2011. This chapter introduces convolutional neural networks, also known as convnets, the type of DL model that is now used almost universally in vision applications.

Convolutional Neural Networks

The chapter first introduces Convolutional Neural Networks

Training a Convnet from Scratch

The chapter then walks through training a convnet from scratch.

It look at what is ‘enough data’: depends on the problem you’re solving. If your problem is complex you need more data. Convnets are data efficient though, plus you can look at pre-trained models.

Kaggle offer an API to access data, p. 212 walks through setting up a kaggle dataset on Google Colab.

p. 215 then walks through building the model. It’s a slightly extended version of the demo convnet shown earlier.

Along the way it introduces Tensorflow Datasets as efficient ways of creating data pipelines for model training.

Data Augmentation

Overfitting is caused by having too few samples to learn from, so you can’t train a model to generalize.

Data augmentation takes the approach of generating more training data from existing training samples by augmenting the samples via a number of random transformations that yield believable-looking images. This exposes the model to more aspects of the data so it can generalize better.

In keras you can add data augmentation layers at the start of the model to do this. The book shows an example of building a sequential model that chains several random image transformations, horizontal flipping, rotation, and zooming:

data_augmentation = keras.Sequential(
    [
	layers.RandomFlip("horizontal"),
	layers.RandomRotation(0.1),
	layers.RandomZoom(0.2),
    ]
)

Note that augmentation layers are inactive during inference (like dropout layers).

Pretrained Models

A common and highly effective approach to deep learning with small datasets is to use a pretrained model. A pretrained model was previously trained on a large dataset (in images typically for classification). If the dataset is large and general enough, the spatial hierarchy of features learned by the model during training can serve as a generic model of the visual world.

For example, you might use a model trained on ImageNet (animals and everday objects) to identify furniture items in images.

There are two ways to leverage pretrained models, feature extraction and fine-tuning.

Feature Extraction

Feature extraction consists of using the representations learned by a previously trained model to extract interesting features from new samples. These features are then run through a new classifier, which is trained from scratch.

The method uses the convolutional base of the model, the layers before the densely connected classifier. In the case of convnets we run the new data through the convolutional base of a pre-trained model, and then train a new classifier on top of the output.

Why just use the base, and not the classifier too? This should be avoided. The reason is thtat the representations learned by the convolutional base are likely more generic and therefore more reusable. The feature maps learned by the convolutional layers are presence maps of generic features. But the representations learned by the classifier will be specific to the set of classes on which the model was trained, they will contain only info about the presence probability of this or that class. They also get rid of the notion of space, whereas the object location is still described in the convolutional feature maps.

Note that the level of generality and reusability of the representation extracted depends on the depth of the layer in the model. Layers that come earlier in the model extract local, highly generic feature maps (like edges, colours, textures). Layers that are higher up extract more abstract features (cat ear). So if your new dataset differs a lot from the source, you may be better off only using the first few layers of the model to do feature extraction, rather than the whole base.

Several pretrained models are available in keras.applications docs. P. 227 walks through setting one up.

Once you’ve instantiated a pre-trained model there are a couple of options:

Run the convolutional base over the dataset, record its output to a NumPy array on disk, then use that data as input to a stand-alone, densely connected classifier. This is fast and cheap to run, as you only run the convolutional base once for every image, and the base is the expensive bit of the pipeline. But we can’t use data augmentation with this method.
Extend the model we have by adding Dense Layers on top, and run the whole thing end to end on the input data. Then we can use data augmentation, as every input image goes through the convolutional base every time it’s seen. But this is more expensive.
P. 229 shows the first process - fast feature extraction without data augmentation, and then training a separate classifier. The main problem is the overfitting that results from lack of data augmentation.
P. 231 shows the second process - feature extraction with data augmentation. To do this you have to first freeze the convolutional base, ie prevent its weights from being updated during training. Otherwise representations carefully learned on the large dataset will be destroyed by the backpropagation of the dense layer weight updates. To freeze a layer in keras set trainable to False.
You can then chain together: A data augmentation stage, the frozen convolutional base, and the dense classifier, to create an end-to-end model with data augmentation. This is much slower, and only trainable on a GPU, but it’s likely to bump the performance.

Fine-Tuning

Fine-tuning is complementary to feature extraction, and widely used. It consists of unfreezing a few of the top layers of the convolutional base, then jointly training the new part of the model (eg the classifier) and these top layers. It’s fine-tuning because it only slightly adjusts the more abstract representations of the model to make them more relevant to the task at hand.

It’s only possible to fine-tune the top layers of the convolutional base once the classifier on top has already been trained. If the classifier isn’t already trained, the error signal propagating through the network during training will be too large, and the representations learned by the layers being fine-tuned will be destroyed. So the steps to fine-tune are:

Add the custom network on top of an already-trained base.
Freeze the base network.
Train the part we added.
Unfreeze some layers in the base network.
Jointly train these layers and the part we added.

The chapter demos unfreezing the last three convolution layers. Why not more? Or the whole thing? Two main reasons - the early layers learn very generic features, it’s more useful to fine-tune the later layers that learn specialized features. You see fast-decreasing returns in tuning lower layers. Second, the more parameters you train the more you risk overfitting.

P. 236 walks through unfreezing the top layers and fine-tuning the model.

Alex's Notes