Accelerating FP16 Inference on Volta

I know there’s previously been some compile flags that were required to get FP16 acceleration on the Pascal generation chips. Is this still the case with Volta? I’ve recently been testing inference on a FP16 model and I’m not seeing any speedups relative to the same model with FP32 params.

I’ve set USE_CUDA / USE_CUDNN = 1. I haven’t modified the gpu archs / sm flags in the Makefile. I’m building from the tip of master, commit 9f97dac76e43b2ca0acb09a4ff96d416e9edea60.

1 Like

check out http://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf

2 Likes

There are no special key needed during the compilation of the MXNet. However, there is a key that is needed during the training: --dtype float16

It would be helpful if you can give more information about:

  • how model is trained
  • how model is used

Specifically, see Section 5.3.1. Running FP16 Training on MXNet, to see the compile-time flags and then to verify that MXNet is trained with FP16.

See also:
[1] http://on-demand.gputechconf.com/gtc/2017/presentation/s7218-training-with-mixed-precision-boris-ginsburg.pdf
[2] https://github.com/apache/incubator-mxnet/issues/7996

Exactly what I’m after, many thanks.

In my case I’m training NMT models, and I believe the dtype flag applies to some of the computer vision samples. This brings up a good point though, if we expose that flag in sockeye we’ll try and keep it consistent with the compvis samples.

My build steps are documented here: https://github.com/awslabs/sockeye/tree/master/tutorials/wmt
but really I’m just after any compile flags that are required. These compile flags are in the docs linked by dom.

but how to use fp16 to infer a batch data by C++ api?

but how to use fp16 to infer a batch data by C++ api?