Marginal performance improvement with Titan V (volta) + CUDA 9 + CUDNN 7

Raised the issue here: https://github.com/apache/incubator-mxnet/issues/9087

We’ve received a couple NVIDIA Titan V (Volta) cards and experimenting with half precision and without half precision we’re seeing marginal performance improvement with half-precision (dtype = float16) set, also tried with Titan X (Pascal), although we didn’t expect half precision to work on Pascal architecture.

This was tested with release 1.0.0

Running on a machine with CUDA 9.0 + CUDNN 7.0.5

To reproduce, one epoch on resnet for CIFAR10 script:

time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0

for Titan V (Volta) we’re getting:

~2700 samples/sec with half precision on, and ~2900 samples/sec when off. Which I believe should be the opposite if anything.

Also we’re not getting a massive speed improvement between the Titan X (Pascal) and Titan V (Volta).

for Titan X (Pascal) we’re getting:

~2600 samples/sec with half precision on, and ~2228 samples/sec when off.

The performance improvement on the Titan X (Pascal) is much better.

I got similar result, and found an explanation on the topic:

Here is the explanation:

To enable it, you need to set the datatype parameter to CUDNN_DATA_HALF when calling
cudnnSetConvolutionNdDescriptor or cudnnSetConvolution2dDescriptor_v5
Of course, the input tensor and output tensor need also to be of datatype CUDNN_DATA_HALF
If you call cudnnSetConvolutionNdDescriptor with datatype CUDNN_DATA_FLOAT but the tensor are of
type CUDNN_DATA_HALF, then the input are converted from fp16 → fp32 and the math are done in FP32 > and the output is converted back to FP16

Additionally, I think the problem exist at the following lines of code in
mxnet/src/operator/convolution.cu, line 63:

// On fp16-I/O instances, use fp32 compute (i.e. pseudo-fp16).              
int compute_type = (dtype == mshadow::kFloat16) ? mshadow::kFloat32 : dtype;

If compute_type is changed to dtype directly, I think the performance problem in FP16 can be resolved.

Thanks liangfu.

Are you suggesting to just do a straight assignment to dtype to compute_type without the check? Sounds logical. Good catch. I can give it a try and will report back.

Elie