cm3015 Topic 08: Mathematical Building Blocks of Neural Networks
Main Info
Title: Mathematical Building Blocks of Neural Networks
Teachers: To come
Semester Taken: October 2021
Parent Module: cm3015 Machine Learning and Neural Networks
Description
Topic 8 develops the key mathematical conscepts: tensors, transformations and stochastic gradient descent. This week’s main objective is to discover how data is packed into tensors, the fundamental data representation of neural networks. But first we get a computer to recognise hand-written digits - our first neural network.
Key Reading
Chollet: Chapter O2: Mathematical Building Blocks
Lab Summaries
There is one lab at the end of week 14, where you experiment with different hyperparameters for the MNIST dataset, specifically the learning rate and momentum on the RMSprop
optimizer.
Lecture Summaries
Again the lectures mainly follow Chollet’s book, in this case chapter 2. Notes here just cover what they seem to add to Chollet.
The first sequence (8.1) walks through Chollet’s ‘first look’ section in chapter 2. Introducing the TF/Keras API. For more on this see Chollet: Chapter O3: Intro to Keras and TensorFlow
8.2 Tensors
The second sequence (8.2) walks through Chollet’s description of tensors, again in ch. 2. It uses NumPy arrays though rather than TF tensors.
8.205 is a useful summary of key tensor attributes.
It summarizes the table:
Array | Tensor | Maths |
---|---|---|
0D | rank 0 | scalar |
1D | rank 1 | vector |
2D | rank 2 | matrix |
3D | rank 3 | - |
Tensor rank is the number of axes.
Tensor shape is the number of elements along each axis (the length of each axis).
Note that tensor shape is a Python Tuple
. This explains why the shape of a vector (rank 1 tensor) is, eg, (3,)
with the trailing comma. Without the comma the expression wouldn’t be a tuple, it would just be the value 3.
Tensor type is the data type of the contained data.
8.3 Tensor Operations
The third sequence 8.3 reviews Tensor Operations (Chollet). Again it uses NumPy rather than TF.
8.305 on broadcasting is useful.
Mentions the NumPy broadcasting rule, which gives more detail than Chollet on broadcasting:
Tensor shapes are compared right to left.
Dimensions are compatible if they are equal or one of them is 1.
The axis with dimension 1 is copied to match the other.
For example, suppose you want to scale the colours of a 256 x 256 RGB image by a vector of 3 scale factors:
image = 256 x 256 x 3
scale = 3
result = 256 x 256 x 3
In the following case the first dimension is not compatible, so broadcasting will throw an error:
x = 5 x 1 x 3
y = 4 x 2 x 3
8.307 walks through an example of tensor dot operation, which Chollet doesn’t do:
Take the operation
relu(dot(w,x) + b)
wherew
is a rank 2 tensor of shape(512,784)
,x
is a rank 1 tensor of shape(784,)
andb
is a rank 1 tensor of shape(512,)
Then
dot(w,x)
is going to produce a rank 1 vector of shape(512,)
. The elementi
ofdot(w,x)
will be:\[(w \cdot x)_i = \sum^{783}_{j=0} w_{ij} x_j\] \[ i = 0,1,\dots ,511\]
As
dot(w,x)
is the same shape asb
they are added element-wise.Then
relu
is applied element-wise to the result of that.This gives an output of the layer of a rank 1 tensor of shape
(512,)
, the composition of elementary tensor operations.Video 8.311 includes a definition of an affine transformation as being a linear transformation followed by a translation (ie \(ax + b\)).
Adding a non-linear activation function introduces distortions in the space (allowing more complex functions to be learned).
8.4 Gradient Descent
8.4 reviews Gradient Based Optimization (Chollet)
8.405 goes into a bit more detail on differential calculus than the textbook.
8.407 points out you often see the gradient tensor written as \(\nabla f\) (nabla f). It gives some more mathematical notation which Chollet avoids. It goes into a bit more detail on momentum use in optimizers.