Keras API Basics (Chollet)
As presented in Chollet: Chapter O3: Intro to Keras and TensorFlow
Layers
The fundamental data structure in neural networks is the layer. A layer is a data processing module that takes as input one or more tensors and that outputs one or more tensors. Some layers are stateless, but most have a state: the layer’s weights, one or more tensors learned with stochastic gradient descent, which together forms the network’s knowledge.
Think of layers as the LEGO bricks of deep learning. You clip together compatible layers to form useful data-transformation pipelines.
Different types of layer are useful for different types of data processing, and different tensor shapes. For example:
Simple vector data, stored in rank-2 tensors of shape
(samples, features)
are often processed by densely connected layers, also called fully connected, or dense layers. In Keras this is theDense
class.Sequence data, like text or time-series, stored in rank-3 tensors of shape
(samples, timesteps, features)
is typically processed by recurrent layers like anLSTM
layer, or 1D convoluation layers likeConv1D
.Image data, stored in rank-4 tensors of shape
(samples, height, width, channels)
, is usually processed by 2D convolution layers likeConv2D
.
The Base Layer Class
Keras is based on the philosophy that a simple API should have a single abstraction around which everything is centered. In Kera’s case that is the Layer
class. Everything in Keras is a Layer
or something that interacts closely with a Layer
.
A Layer
is an object that encapsulates some state (weights) and some computation (the forward pass).
The weights are typically defined in a build()
method instead of on initialization instead.
The computation is defined in a call()
method.
Here’s what a simple dense layer that implements output = activation(dot(input, W) + b)
looks like:
from tensorflow import keras
class SimpleDense(kerals.layers.Layer):
def __init__(self, units, activation=None):
super().__init__()
self.units = units
self.activation = activation
def build(self, input_shape):
input_dim = input_shape[-1]
self.W = self.add_weight(shape=(input_dim, self.units),
initializer='random_normal')
self.b = self.add_weight(shape=(self.units),
initializer='zeros')
def call(self, inputs):
y = tf.matmul(inputs, self.W) + self.b
if self.activation is not None:
y = self.activation(y)
return y
you can then use it by instantiating it and calling it:
my_dense = SimpleDense(units=32, activation=tf.nn.relu)
output = my_dense(input)
So why did we just define build and call functions, then just use the __call__~
method? Because under the hood Keras creates the state on the fly.
Remember that layers need to be compatible, every layer will only accept inputs of a certain shape and return outputs of a certain shape. If we specified input shape to layers on creation, we’d end up with verbose, brittle pipelines. This is even worse if the logic determining the output shape is complex.
So Keras implements automatic shape inference for us. This builds the layer’s weights the first time the layer is called, with the appropriate input shape. Under the hood it is basically doing this:
def __call__(self, inputs):
if not self.built:
self.build(inputs.shape)
self.built = True
return self.call(inputs)
Now we just have to specify the output shape when we instantiate a layer, and Keras takes care of the rest. The __call__
method does a lot more than this too, but this is the basics.
Models
A deep learning model is a graph of layers. In Keras it’s represented by the Model
class.
Sequential
models are a subclass of the base Model
class, which are simple ‘feed forward’ stacks of layers that map a single input to a single output. But there are a variety of network topologies that get very involved. Some more common ones include two-branch networks, multihead networks, residual connections. See eg the Transformer architecture.
You can build models by subclassing the Model
class or using the functional API. The book digs into the details in ch. 7.
The topology of the model defines the hypothesis space. Remember ML involves the search for useful representations in a predefined space of possibilities. The network topology restricts the space to a set of tensor operations, you can then search for a good set of values for the weight tensors in these operations.
The network architecture encodes the assumptions you make about your data. If you use a single Dense layer with a pure affine transformation for a two class classification problem, you are assuming the classes are linearly separable. The book teaches some guiding principles for architecture selection, but a lot of it is experience and intuition that builds up over time.
Compiling a Model
Once you’ve defined the architecture there are three more things to choose:
Loss function (objective function): the quantity that will be minimized during training. The measure of success for the task.
Optimizer: Determines how the network will be updated based on the loss function. Implements a flavour of SGD.
Metrics: The measures of success you want to monitor during training and validation. The optimizer won’t optimize directly for these, so they don’t need to be differentiable.
In Keras we specify these with the model’s
compile
method. For example:model.compile(optimizer="rmsprop", loss="mean_squared_error", metrics=["accuracy"])
The strings here are just shortcuts that get converted to objects. You can pass instance objects directly instead. This is good if you want to use your own custom losses, metrics, optimizer or if you want to specify a parameter that is not the default. For example if you want to specify the learning rate for the optimizer. Generally you won’t need to build your own losses, metrics, or optimizers from scratch as Keras implements a huge variety of the main ones.
Picking a Loss Function
The choice of loss function is crucial. Models will ruthlessly pursue minimizing the loss, regardless of side-effects.
When it comes to comon problems like classification, regression, sequence prediction, there are simple guidelines to follow.
For example, two-class classification you use binary crossentropy. Many-class classification you use categorical crossentropy. Only if you are working on new research problems will you have to define your own loss function.
Fitting a Model
After compliation comes fitting. The fit()
method implements the training loop itself. It takes the following main arguments:
The data (inputs, targets) to train on. Either NumPy arrays or a TensorFlow
Dataset
object.The number of epochs to train for: how many times the training loop should iterate over the data passed.
The batch size to use within each epoch of mini-batch gradient descent. This is the number of samples considered to compute the gradients for one update step.
A typical call would look like this:
history = model.fit(inputs, targets, epochs=5, batch_size=128)
And would return a
History
object, with ahistory
field. That field is a dict mapping keys like “loss” or the metric names to a list of values after each epoch.
Reserving a validation set
You can also pass the fit
method a reserved validation set of samples that will be used to calculate the validation loss during the training process.
This would look like this:
model = keras.Sequential([keras.layers.Dense(1)])
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=0.1),
loss=keras.losses.MeanSquaredError(),
metrics=[keras.metrics.BinaryAccuracy()])
indices_permutation = np.random.permutation(len(inputs))
shuffled_inputs = inputs[indices_permutation]
shuffled_targets = targets[indices_permutation]
num_validation_samples = int(0.3 * len(inputs))
val_inputs = shuffled_inputs[:num_validation_samples]
val_targets = shuffled_targets[:num_validation_samples]
training_inputs = shuffled_inputs[num_validation_samples:]
training_targets = shuffled_targets[num_validation_samples:]
model.fit(
training_inputs,
training_targets,
epochs=5,
batch_size=16,
validation_data=(val_inputs, val_targets)
)
You can also get validation loss and metrics after the fact, by calling the model’s evaluate
method like this:
model.evaluate(val_inputs, val_targets, batch_size=128)
This will iterate in batches and return a list of scalars, the first entry is the validation loss, and the remainders are the validation metrics.
Using a Model for Inference
Once a model is fit, you can use it for inference on new data with the predict
method. Don’t just call the model directly as this will process everythign in one go and can fall over. Here’s a sample predict call:
predictions = model.predict(new_inputs, batch_size=128)
That’s it!