cm3015 Topic 10: Fundamentals of Machine Learning
Main Info
Title: Fundamentals of Machine Learning
Teachers: Tim Blackwell
Semester Taken: October 2021
Parent Module: cm3015 Machine Learning and Neural Networks
Description
We end our tour of deep neural networks with practical advice: data preprocessing, spotting and dealing with under- and over-fitting, and the universal machine learning workflow.
Key Reading
Chollet: Chapter O5: Fundamentals of Machine Learning
Chollet: Chapter O6: Universal Workflow
Lecture Summaries
Week 19 lectures cover Chollet: Chapter O5: Fundamentals of Machine Learning
10.2 covers validation and evaluation.
Simple hold-out validation is vulnerable to statistical variation with small datasets (check the variations across different splits). It involves reserving a small set of training data to validate the model during training, then you fold it back in to train the final model. There is also a separate test set for final evaluation.
K-fold validation. Split the data into K equal partitions. For each fold you reserve one of the partitions and train on the rest. Repeat K times.
10.3 covers data pre-processing
All inputs must be floating point tensors. Gradient descent works best if data is homogenous and constrained. Aim for inputs to be in [0,1] or [-1,1] and initialize weights/biases with small values to avoid dominance.
All features should take values roughly in the same range. Inhomegeneity can trigger large gradient updates and hamper convergence.
Normalization to a mean of 0 and SD of 1 can help, but not always necessary.
Missing data: perhaps try filling missing values with 0 (if 0 is not already meaningfully), the model will learn to ignore 0. If the test set has missing data but training doesn’t, duplicate some training records but then randomly place some 0s to reflect missing data.
10.4 Covers overfitting/underfitting
Looks at three regularization techniques:
Reducing network capacity (the number of trainable parameters). A low capacity model will underfit, a high capacity model will learn a perfect mapping that might fail to generalize. We want sufficient but limited capacity, in order to generalize. Begin with a small model then increase until performance flattens.
Adding weight regularization. We prefer simple models. Parameters are regular if they occupy a small interval close to 0. Large weight parameters will amplify input noise. The network will attempt to fit the noise, a model with smaller weights is more robust. Regularization is the attempt to prevent weights getting too big. There are a couple of approaches.
In L1 regularization we apply a cost proportional to the absolute value of the weight parameters to the loss function. In L2 regularization the cost is proportional to the square of the weight parameters.
- Dropout. Common and effective. A fraction, the dropout rate, of the output elements from a layer are zerod. The dropout rate is usually [0.2,0.5]. Output of the final network needs to be scaled down by the same rate (no elements are removed in the final model).
Week 20 lectures cover Chollet: Chapter O6: Universal Workflow They don’t add anything to the book. It talks about 7 steps rather than 3 (older version of the workflow?) but it’s essentially the same. Doesn’t cover deployment.