Best practicies when deploying an MXNet model

Hello all. I am trying to deploy my MXNet-based application to the web. I want to be able to handle as many web requests as possible. I am using two inception v3 models per request to provide an image classification. The problem is that web servers often like to provide multiple threads to handle multiple requests at a time, but when load testing my application, I notice that requests are taking a long time to complete.

What are some good suggestions to increase performance when deploying an MXNet model to the web? Should I create a pool of Module objects to handle classification? Should I use one Module object for all requests? Which objects should I cache instead of re-creating each request?


Hi @qheaden,

Are you aware of the MXNet Model Server project (MMS)? I see they have a section on production deployments.

Production Deployments
When launched directly, MMS uses a standalone Flask server. This is handy for testing and development. But for production deployments, we recommend using Gunicorn which should provide lower latency, higher throughput, and more efficient use of memory.

This project includes Dockerfiles to build containers recommended for production deployments. These containers demonstrate how to set up a production stack consisting of nginx, gunicorn, and MMS. The basic usage can be found on the Docker readme.

Another technique that can help you obtain high throughput is to batch up requests and process a batch rather than individual requests one after another. You could process a batch when batch size exceeds a certain number or after a certain amount of time (whichever is sooner). You will have the extra complexity of keeping track of samples though.


Hi @qheaden,

this might be useful also if you are not already tuning with these methods.