Chollet: Chapter 09: Advanced DL for Computer Vision

Metadata

Title: Advanced Deep Learning for Computer Vision
Number: 9
Book: Chollet: Deep Learning with Python

Core Ideas

Chapter 8 inroduced convnets in the context of binary image classification, but computer vision goes way beyond this. There are three essential computer vision tasks you should know:

Image Classification the goal is to assign one or more labels to an image. It could be single-lable or multi-label classificaiton. EG the Google photos app has a large multilabel classification model behind it, with over 20,000 labels and trained on millions of images.
Image Segmentation The goal is to segment or partition an image into different areas, with each area representing a category. EG if you use zoom with a backdrop, the image segmentation model separates out your face from the background, with pixel precision.
Object Detection The goal is to draw rectangles, bounding boxes around objects of interest in an image, associating each rectangle with a class. A self-driving car would use this to monitor cars, pedestrians, signs etc.
There are other, niche tasks, like image similarity scoring, keypoint detection, pose estimation, 3D mesh estimation. But most applications boil down to one of these three.
The chapter focuses on image segmentation. Object detection is more complex. It points you to a keras example for more on that.

Image Segmentation

Image segmentation with deep learning is about using a model to assign a class to each pixel in an image, thus segmenting thte image into different zones (such as “background”/“foreground”, or “road”,“car”,“sidewalk”).

This powers a wide range of valuable applications, from image and video editing to autonomous driving, robotics, medical imaging etc.

There are two main flavours of image segmentation:

Semantic segmentation, where each pixel is independently classified into a semantic category like “cat”. If there are two cats in the image the corresponding pixels are all mapped to the generic “cat” category.
Instance segmentation, where the goal is not only to classify image pixels by category, but to parse out individual object instances. So with an image of two cats, “cat1” and “cat2” would be two separate classes of pixels.
The chapter walks through a semantic segmentation example using the Oxford-IIIT pets dataset.
It introduces the idea of a segmentation mask, which is the image segmentation equivalent of a label. It’s an image the same size as the input, with a single colour channel where each integer value corresponds to the class of the corresponding pixel in the input. For example, in the case study there are three integer values: 1 (foreground), 2 (background), or 3 (contour).
pp. 244-5 walk through the model structure. There are a couple of big differences from the classification architecture from ch. 8:
- We no longer use MaxPooling2D layers to downsample. Instead every other convolution layer has a stride of 2. This is because we really care about the spatial location of information in the image for segmentation tasks, and max pooling throws that away within the pooling window. We tend to use strides rather than max pooling for tasks where we care about feature location (eg generative models).
- There is a whole second half of the model that is a stack of Conv2DTranspose layers. These essentially upsample the result of the previous convolution stack. Remember we want the output image to be the same size as the input. So the first stack of convolutions extracts the features of the image by downsampling, and the second stack learns to upsample the resulting information to the same size as the original.

Modern Architecture Patterns

So far we’ve been working with toy examples and model architectures to get a sense of the fundamentals. But in the real world we rely on architecture patterns to build more complex, state of the art models. The chapter then introduces Convnet Architecture Patterns.

Interpreting Convnets

DL is often criticised as producing black box models. But convnets are an exception, the representations they learn can be easily visualized.

There are three very accessible and useful methods in particular:

Visualizing intermediate convnet outputs (intermediate activations).

Useful for understanding how successive layers transform their input, and getting a first idea of the meaning of individual filters.

pp 262 - 268 walk through an example.

The output of a layer is often called its activation, the output of its activation function. In a convnet each channel of the output encodes relatively independent features, so we plot each channel separately on a 2D grid.

The example shows a few things:

The first layer acts as a collection of edge detectors, all the activations retain almost all of the information in the input image.
As you go deeper, representations become increasingly abstract and less interpretable. Deeper presentations carry less information about the visual content, and more related to the class of the image.
The sparsity of activations increases with the depth of the layer. In the first layer, almost all filters are activated, but in the following layers more and more filters are blank. IE the pattern encoded by the filter isn’t found.
This is an important general principle. A DL model acts as an information distillation pipeline, with raw data going in and being repeatedly transformed so that irrelevant information is filtered out and useful information is magnified and refined. This is analagous to how humans process information.

Visualizing convnet filters.

Useful for understanding precisely what visual pattern or concept each filter in a convnet is receptive to.

The second approach is to display the visual pattern that each filter is meant to correspond to. We do this by gradient ascent in input space, apply gradient descent to the value of the input image of a convnet so as to maximize the response of a particular filter.

The resulting visualizations show a lot of how the convnet filters see the world. See pp 268 - 73 for a worked example and some filters.

Visualizing heatmaps of class activation in an image.

Useful for understanding which parts of an image were identified as belonging to a given class.

The final approach is really useful for seeing how the decision on classification was made. It generates a heatmap for each class over an image, showing what, eg ‘cat-like’ parts were found, and what ‘dog-like’ parts were found.

See p. 273-9 for an example with elephants!

Alex's Notes