Alex's Notes

Chollet: Chapter 04: Getting Started with Neural Networks

Metadata

  • Title: Getting Started with Neural Networks: Classification and Regression

  • Number: 4

  • Book: Chollet: Deep Learning

Core Ideas

This chapter introduces three ‘real-world’ case studies with the most common use cases of neural networks - binary classification, multi-class classification, and scalar regression.

Glossary

TermDefinition
Sample or inputOne data point that goes into your model
Prediction or outputWhat comes out of your model
TargetThe truth. What your model should have predicted, according to an external source of truth
Prediction error or loss valueA measure of the distance between your model’s prediction and the target.
ClassesA set of possible labels to choose from in a classification problem. For example, when classifying cat and dog pictures, “dog” and “cat” are the two classes.
LabelA specific instance of a class annotation in a classification problem. For instance, if picture #1234 is annotated as containing the class “dog,” then “dog” is a label of picture #1234.
Ground-truth or annotationsAll targets for a dataset, typically collected by humans.
Binary ClassificationA classification task where each input sample should be categorized into two exclusive categories.
Multiclass ClassificationA classification task where each input sample should be categorized into more than two categories: for instance, classifying handwritten digits.
Multilabel ClassificationA classification task where each input sample can be assigned multiple labels. For instance, a given image may contain both a cat and a dog and should be annotated both with the “cat” label and the “dog” label. The number of labels per image is usually variable.
Scalar RegressionA task where the target is a continuous scalar value. Predicting house prices is a good example: the different target prices form a continuous space.
Vector RegressionA task where the target is a set of continuous values: for example, a continuous vector. If you’re doing regression against multiple values (such as the coordinates of a bounding box in an image), then you’re doing vector regression.
Mini-batch or batchA small set of samples (typically between 8 and 128) that are processed simultaneously by the model. The number of samples is often a power of 2, to facilitate memory allocation on GPU. When training, a mini-batch is used to compute a single gradient-descent update applied to the weights of the model.

Binary Classification

The first example is binary classification using the imdb dataset of movie reviews, which are labelled positive/negative.

The dataset comes pre-process in keras, with words mapped to integers from a lexicon. The reviews, then, are lists of integers. But the lengths vary. You can’t feed variable length inputs into a neural network. There are two ways around this:

  • Pad the lists so they all have the same length, turning them into a tensor of shape (samples, max_length). Then you can start your network with an Embedding layer capable of handling such integer tensors. (Embeddings are covered in more detail in the text chapters).

  • Multi-hot encode the lists to turn them into vectors of 0s and 1s, where 1 indicates the presence of the corresponding token from the lexicon. Then you can use a Dense layer as the first layer in the model.

    Here’s an example function that multi-hot vectorizes a sequence of integers:

    def vectorize_sequences(sequences, dimension=10000):
        results = np.zeros((len(sequences), dimension))
        for i, sequence in enumerate(sequences):
        for j in sequence:
            results[i,j] = 1.
        return results
    

Building the model

This is one of the simplest problem setups you’ll encounter. Input data is vectors, labels are scalars. A plain stack of densely connected Dense layers with relu activations will work well for this kind of problem.

So we need to decide how many layers, and how many units to use for each layer. Formal principles for this are covered in ch. 5. For now we just use two intermediate layers with 16 units, and third output layer which will give the scalar prediction. So we define the model as:

model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

The first argument passed to a layer is its number of units, which is the dimensionality of the representation space of that layer. Having 16 units means that the weight matrix W will have shape (input_dimension, 16). We’ll take the dot product of the input data with this matrix and so project it onto a 16 dimensional representation. Then we’ll add the bias to this vector, and apply the activation function.

Intuitively, think of the dimensionality of the representation space as how much freedom you’re allowing the model to have when learning internal representations of the data. More units = more dimensions = more complex representations. But, this adds computational cost, and means your model could learn unwanted patterns that reduce its ability to generalize.

The ‘relu’ activation function zeros out negative values, while the final sigmoid activation function flattens values to the range [0,1], to be interpreted as a probability.

You need to choose a loss function. When dealing with models that output probabilities crossentropy is usually best. Here we have two classes, so we can use binary crossentropy. You could use mean squared error, but likely best to stick with crossentropy.

Finally you need to choose an optimizer. rmsprop is a good default for virtually any problem.

So we can compile the model like this: model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

pp. 102-4 go through the process of training the model and showing its loss on the validation and train set. It shows how to plot these so you can see when the model starts to overfit. In this case we start overfitting when we pass 4/5 epochs, so we retrain and stop early.

Multiclass Classification

The chapter then uses the Reuters dataset as a single-label multiclass classification problem.

The documents are vectorized as before. Since we now have lots of labels we can encode them as one-hot vectors, ie all 0s except for the correct label which is 1. There is a built-in method in keras to do that from tensorflow.keras.utils import to_categorical, it’s also easy to do yourself there’s sample code on p. 108.

Model Design

So the problem’s similar to the first, but now we’re dealing with 46 classes rather than 2. This raises an issue. Each layer can only access information output from the previous layer. If a layer drops relevant information, all subsequent layers will miss it too. Using only 16 units may be too limited to discriminate between 46 different classes. So we use larger layers (64 units).

Our output layer also needs to be larger, 46 units, one for each class. We’ll also need a different activation function for the final layer. softmax will output a probability distribution over the 46 classes.

For the loss function, we need categorical_crossentropy which measures the distance between two probability distributions. So our final model looks like this:

model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(46, activation="softmax")
    ])

model.compile(optimizer="rmsprop",
	      loss="categorical_crossentropy",
	      metrics=["accuracy"])

The evaluation is much as before, we end up with an accuracy of 80%. Is this good?

Some code to quickly test a random baseline:

import copy
test_labels_copy = copy.copy(test_labels)
np.random.shuffle(test_labels_copy)
hits_array = np.array(test_labels) == np.array(test_labels_copy)
hits_array.mean()

When the model is used to predict a label, it will return a vector with a probability distribution across the 46 classes. The predicted class will be the highest value: np.argmax(predictions[0]) for the first prediction.

There is another way to handle the labels if not using one-hot vectors, you could use an integer tensor. Then the loss function would be slightly different sparse_categorical_crossentropy rather than categorical_crossentropy. It’s the same math, different interface.

The chapter shows the impact of bottlenecks by reducing the size of one of the layers to 4 dimensions, accuracy dropping 8% as a result.

Regression

Another common ML problem type is regression where the goal is to predict a continuous value rather than a discrete label. For example, predicting the temperature, or the time a project will take, or a price.

The chapter walks through an example using the Boston house price dataset from the 70s.

The main data preparation step involves feature-wise normalization. The issue is that we don’t want to feed in data with wildly different ranges into our model. With feature-wise normalization we do the following to avoid that:

For each feature in the input data we subtract the mean of the feature and divide by the standard deviation, so the feature is centered around 0 and has a unit standard deviation. In Numpy this is straightforward:

mean = train_data.mean(axis=0)
train_data -= mean

std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

Note that you do not use the test data for these calculations, never use the test data for anything in your workflow!

Building the Model

In general, the less training data you have, the worse you’ll suffer from overfitting. This dataset is small, so we’ll use a small model to minimize this risk.

A typical setup for scalar regression is for the final layer to use a single unit and no activation (ie it will be linear). It means the model is free to predict values of any range.

The loss function is often mean squared error. We can also monitor mean absolute error to get a sense of if we’re under/over on average. This leaves the final model:

model = keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(64, activation="relu"),
    layers.Dense(1)
    ])

model.compile(optimizer="rmsprop",
	      loss="mse",
	      metrics=["mae"])

For validation, the chapter walks through K-fold validation, given the dataset is so small this is a good idea here. It’s already covered in the module so not noted here. We struggle to get a great result on the dataset.