Model Deployment (Chollet)

As presented in Chollet: Chapter O6: Universal Workflow

Stakeholder Engagement

The actual system you deliver is half the picture, the other half is setting appropriate expectations at the outset. People often have unrealistic expectations of AI systems. Consider showing the failure modes of the model (what incorrectly classified examples look like).

Avoid abstractions like “the model has 98% accuracy” (which people round up to 100). Talk instead about falst negative and false positive rates. For example “The fraud detection model will have a 5% false negative rate and 2.5% false positive. Every day an average of 200 valid transactions will be flagged as fraudulent and sent for manual review, and an average of 14 fraudulent transactions will be missed. An average of 266 fraudulent transactions would be correctly caught”.

Discuss key parameters with stakeholders, eg the probability threshold at which a transaction should be flagged and the trade-offs in that decision.

Shipping a Model

You might need to export the model to an environment other than Python. You can often optimize it if it’s just being used for inference to make it smaller and faster in production. There are different deployment options:

REST API

Maybe the most common option, you install TF on a server and query its predictions with a REST API. You could use your own web framework for this or use TF’s own library for shipping models as APIs: TensorFlow Serving

Use the API option when the consumer app will have reliable internet access (careful with mobile devices and airplane mode/low connectivity areas). Also when latency requirements aren’t strict - the round trip takes 500ms on average. Careful with sensitive data too, it will need to be visible on the server even if you enrcrypt during transmission.

Google offer a hosted service, now called Vertex AI which lets you upload a model and get an endpoint deployed.

On Device Deployment

Sometimes you need to deploy directly on a device. EG you want to deploy on a camera to recognize faces. Use this when you have strict latency needs, or low-connectivity.

Your model needs to be sufficiently small to run in memory on your target device. You can use TF’s model optimization toolkit

You might need to trade off performance for runtime efficiency, so accuracy might not be the most important thing here.

Use when sensitive data means you can’t ship it back to the server.

The go-to option for deployment on mobile devices or embedded devices, raspberry pi etc is TensorFlow Lite

Browser Deployment

Use this when you want to offload compute to the user’s computer. If the input data needs to stay on the user’s device. If your app has strict latency requirements. You need the app to work without connectivity.

You can only do this if your model is small enough not to hog the user’s resources. Also if the model itself is not sensitive, since it will be shipped to the user.

You can use TensorFlow.js for importing a keras model to JS.

Model Optimization

This is especially important when deploying to smartphones, embedded devices etc, or for low-latency requirements. Optimize before importing into TF.js or TF Lite. Two popular optimization methods are:

Weight pruning. You can often lower the number of parameters in model layers by keeping the most significant ones. This will have a small cost in performance metrics but often substantially reduce the memory and compute footprint of the model. You can decide the trade-off you want.
Weight quantization It’s possible to quantize weights from single-precision floats to 8bit signed integers to get an inference-only model that’s a quarter the size but remains near the accuracy of the original.
TF has a toolkit for doing this that plays nice with Keras.

Monitoring and Maintenance

Once you’ve deployed, you need to keep monitoring behaviour, performance on new data, and interaction with the rest of the app, and eventual impact on business metrics.

Is user engagement up or down? Consider randomized A/B testing to isolate the impact of the model.

If possible do manual audits on production data. If not try surveys or another method of getting feedback.

As soon as the model’s shipped you should start preparing its replacement. Watch for changes in production data, keep collecting and annotating data.

Alex's Notes