Alex's Notes

Convnet Architecture Patterns

As presented in Chollet: Chapter O9: Advanced DL for Computer Vision

A model’s “architecture” is the sum of the choices that went into creating it. The layers to use; their configuration; the arrangement of their connections. These choices define the hypothesis space of your model, the space of possible functions that gradient descent can search over, parameterized by the model weights.

Like feature engineering, a good hypothesis space encodes prior knowledge about the problem and solution. In order to learn effectively from data you have to make assumptions about what you’re looking for and how you’ll find it.

A good model architecture is one that makes it easier to converge to a good point of the search space, for example by reducing the size of the search space. You have to make the problem easier for gradient descent to solve. It’s more of an art than a science, intuition coming from experience. But there are some good practices that are introduced here, starting with a general approach to system architecture, and then looking at specific best practices.

Modularity, Hierarchy, Reuse

At the heart of organizing any complex system is the MHR formula. To make a complex system simpler, you structure your complex soup into modules, organize the modules in a hierarchy and start reusing the same modules in multiple places as need be. (reuse here is another word for abstraction).

Deep learning itself is the application of this formula to optimization by gradient descent. You structure the search space into modules (layers), organized into a hierarchy, which you reuse when you can.

DL model architecture is primarily about making clever use of MHR. All popular architectures are not only structured into layers, but repeated groups of layers called blocks or modules.

Most convnets often feature pyramid-like structures, feature hierarchies. EG the way that depth increases with layer depth, as feature maps shrink. Deeper stacks of narrow layers will perform better than shallow stacks of large layers. But the limit to this is caused by the vanishing gradient problem.

Residual Connections

Residual connections are a fix for the vanishing gradient issue. This is caused by the introduction of noise into the learning process. Each successive funcion in the DL chain introduces some amount of noise into the error-learning process. If the function chain is too deep, the noise starts to overwhelm the meaningful gradient information, and backpropagation stops working. The model won’t train at all.

To fix this, we force each function in the chain to be nondestructive. We retain a noiseless version of the information contained in the previous input. In a residual connection we simply add the input of a layer or block of layers to its output. The residual connection acts as an information shortcut around destructive or noisy blocks, so gradient information can propagate noiselessly through a deep network.

Note we might need to project the input to the same shape as the output. P. 253 walks through a concrete example.

Batch Normalization

We’ve already encountered the broad concept of normalization, which is to make different samples seen by a ML model more similar to each other, which helps the model learn and generalize well to new data. The most common form is centering on zero by subtracting the mean, and giving a unit standard deviation by dividing the data by is SD. This makes the assumption that the data follows a normal distribution and makes sure it is centered and scaled to unit variance.

You can do this before feeding the data into a model, but what about the data when it is output by layers? Batch normalization refers to the process of continually normalizing the data during training. It uses the mean and variance of the current batch to re-center the data after it has fed through layers or blocks. During inference it uses an exponential moving average.

We don’t really know why it works, it seems to help gradient descent. But we do know that it does work. Some very deep networks can only train with batch normalization.

BatchNormalization layers can be used after any layer. Chollet recommends placing the layer activation after normalization, but this is a bit controversial. Batch normalization removes the need for a separate bias vector.

Separable Convolutions

Depthwise separable convolution layers perform spatial convolution on each channel of its input independently, then mixes the output channels via pointwise convolution. Effectively this separates the learning of spatial features and channel-wise features.

It relies on the assumption that that spatial locations in intermediate activations are highly correlated, but different channels are highly independent.

These layers require significantly fewer parameters, and involve fewer computations compared with regular convolution, and have comparable representational power. This produces models that converge faster and are less prone to overfitting. Really helpful if you’re training small models from scratch on limited data.

This is the basis of the Xception architecture, a high-performing convnet that comes packaged with Keras.