Model Development (Chollet)

As presented in Chollet: Chapter O6: Universal Workflow

Preprocessing

First prepare the data. Preprocessing includes vectorization, normalization, and handling missing values.

Many preprocessing steps are domain specific (eg text, or image data). Those are covered in later chapters.

Vectorization Inputs and targets for NNs are typically tensors of floating point numbers. You first need to turn your raw data (text documents, images) into tensors, through data vectorization.

Normalization It isn’t safe to feed into NNs data that takes large values or data that is heterogenous (eg data where one feature is in the range [0,1] and the other [100,200]. To make learning easier your data should take small values and be homogenous. Stricter normalization ensures that all features have an independent mean of 0 and a SD of 1. This isn’t always necessary.

Missing Values. If the feature is categorical, it’s safe to create a new cateogry that means ‘missing data’. If numerical, avoid inputting an arbitrary value like 0, because it could create a discontinuity. Instead consider replacing it with the average or median value for the feature in the dataset. You may need to artificially generate examples with missing values if the training set is complete but there will be missing values in production.

Evaluation Protocol

Pick an evaulation protocol, ie holdout validation, k-fold valiation, nested k-fold etc.

Baseline

Your initial goal is to achieve statistical power, develop a small model that can beat a simple baseline. Focus on:

feature engineering filter out uninformative features and develop new features you think would be helpful.
Select the architecture priors - what type of model architecture will you use? RNN, Transformer, CNN? Is DL the right approach?

Select a good enough training config - choose a decent loss function, batch size and learning rate.

Here are some good final layers and loss functions for typical problems:

Problem Type	Last-layer activation	Loss function
Binary Classification	sigmoid	binary_crossentropy
Multiclass, single-label classification	softmax	categorical_crossentropy
Mutliclass, multi-label classification	sigmoid	binary_crossentropy
Regression to Arbitrary Values	None	mse
Regression to [0,1]	sigmoid	mse or binary_crossentropy

Research prior art and templates for your problem, especially feature engineering and model architectures. If you can’t beat a baseline after trying many plausible models, it may be that you can’t predict your outputs based on your inputs.

Scale Up

Once you’ve established statistical power, the question then becomes is your model powerful enough? EG logistic regression has statistical power on MNIST but wouldn’t solve the problem well in general.

You have to develop a model that overfits. You can keep adding layers, make the layers bigger, train for more epochs.

Monitor validation loss and training loss, watch for degradation of performance on validation data. Then you’re overfitting.

Regularize and Tune

Once you’ve got statistical power and are overfitting, your goal becomes to maximize generalization performance. This will take the most time. You repeatedly tweak the model, train it, evaluate on the validation set, and tweak again. Try the following:

Try different architectures, add or remove layers.
Add dropout.
If your model is small, add L1 or L2 regularization.
Try different hyperparameters (eg number of units per layer, learning rate) to find the optimal config.
Optionally, iterate on data curation or feature engineering. Collect and annotate more data, develop better features, remove uninformative features.
You can automate a lot of this with software, eg KerasTuner.
Be wary of information leakage from the validation set by constant and systematic evaluation on it.
Once you’ve got a good model config you can train the final model on all the data available and evaluate it one last time on the test set.
If your performance tanks on the final test set consider switching to a more reliable evaluation protocol.