Alex's Notes

Task Definition (Chollet)

As presented in Chollet: Chapter O6: Universal Workflow

You have to develop deep understanding of the context of the problem, why is your customer trying to solve this problem? What value will they derive from the solution? How will the model be used, and how will it fit into the business processes of the customer? What data is available, or could be collected? What kind of ML task could map to the business problem?

Frame the Problem

What will the input data be, what are you trying to predict? You can only learn to predict something if there is training data so data is likely the limiting factor. You’ll often need to collect/label new data.

What time of ML task are you facing? Binary classification? Scalar or vector regression? Multiclass, multilabel classification? Ranking? Do you need ML or should you look at good old stats? EG music recommendation might best be done through collaborative filtering (matrix factorization) rather than DL.

What systems are already in place to solve the problem? People? Handcoded rules?

What are your constraints? EG is your email app end to end encrypted? Latency constrictions?

Once you’ve done this you should have a sense of your inputs and targets. You’re hypothesizing that the inputs can predict the targets, and that the data will be available. Not all problems are solvable by ML. EG if you’re trying to predict future stock price by price history you’ll fail, there’s not enough information in price history to predict future price.

Collect a Dataset

This is the most arduous, time-consuming, and costly part of most ML projects.

A model’s ability to generalize depends entirely from the properties of the data it’s trained on. Data is worth the investment.

Your data annotation process determines the quality of your targets. Consider whether you should annotate yourself, use crowdsourcing, use services like data-labelling companies.

Consider constraints like do the data labelers need to be subject experts? Annotating CT scans requires a medical degree, labelling dogs and cats not so much. Productive data annotation software can really help.

It’s vital that training data be representative of the production data. If possible collect data directly from the environment where your model will be used.

Be aware of concept drift. This occurs when the properties of production data change over time, causing model accuracy to decay. This is particularly an issue in adversarial contexts like fraud detection, where fraud patterns change almost daily. Dealing with fast concept drift requires constant data collection, annotation, and retraining.

Understand your Data

It’s bad to treat it like a black box. Explore and visualize your data to gain insights about what makes it predictive.

If your data is images or text, look at some samples.

If numerical, plot the histogram of feature values to get a feel for the distribution and range.

If location, plot it on a map.

Are there any missing values?

If classification, print the number of instances in each class. You might need to account for any imbalances.

Check for target leaking, presence of features in yoru dta that provide information about the targets and might not be available in production. EG, you’re training on medical text records to predict cancer, and the record includes “this person has been diagnosed with cancer”. Will all the features you’re training on be available in production?

Choose a Success Measure

Define what you mean by success. Accuracy? Precision and recall? Customer retention? It should directly align with higher-level goals, such as business success of your customer.

For balanced classification a common measure is the area under a receiver operating characteristic curve, abbreviated as ROC AUC. For imbalanced issues we can use weighted forms, or precision/recall.