Huge performance decrease by quantization

I use the code from https://github.com/apache/incubator-mxnet/pull/13715, and I have a huge performance decrease by doing quantization on my model.

Tested on Windows 10 with CUDA 10 and cudnn7 on Titan X (Pascal), using pre-release build from pip mxnet-cu100.

I think we need to do more test on quantization or I just misunderstood the document.

BTW, it can be replicated by mxnet python package, and you might need to run mutli-times to have a reasonable outcome.

Hi @kice,

Many thanks for raising this issue. Could you provide a few more details about how you added quantisation? And to confirm, you’re seeing inference time for a single sample double when you add quantisation? What changed from when you had a x2 speedup with quantisation? Or does it have very high variance?

Cheers,

Thom

I did the quantisation by the offical example link: https://github.com/apache/incubator-mxnet/tree/master/example/quantization.

Yes, I got double speed up by a signle run; but it might just be the first time i ran all the resource was not loaded, and the quantization one had everything ready to go.

And by recent test, I had twice the run time for int8 quantization then fp32 model. If you need a model for testing, i can upload one for comparison, including orginial fp32 model and int8 quantized.

Can you please share a quantized model for testing?
For reasons unknown to me, the quantized mobilenet model predicts 4 times longer than the standard model.

Most likely the HW is not supported for INT8 computation. You need to at least skylake CPU.

Some data in the blog