Is there a flag to enable tensor cores? No speedup from FP16

On my RTX 2080ti, dot products are no faster with FP16 than with FP32 (and the former is 4 times slower than equivalent PyTorch). Is there some flag or environment variable that I’m missing?

import mxnet as mx
import numpy as np
import time

n = 2**14

ctx = mx.gpu(0)
dtype = np.float16

with ctx:
    a = mx.nd.zeros((n, n), dtype=dtype)
    b = mx.nd.zeros((n, n), dtype=dtype)
    c = mx.nd.zeros((n, n), dtype=dtype)


tic = time.time()
for _ in range(100):
    mx.nd.dot(a, b, out=c)
    res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)

(This outputs about 60 for either dtype)

I tried these environment variables (after looking at the source code), but they make no difference in terms of speed. CuDNN is version v7600, apparently.

CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout MXNET_CUDA_ALLOW_TENSOR_CORE=1 MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=1

Tried your code on my volta, float16 gave me 8.9s while float32 gave me 61.7s. You might need to profile it.
On 2080Ti with MXNet 1.6:

  • FP16: 9.2s
  • FP32 56.56.

how do you install mxnet? by pip or build from source

Build from the source. Or you can try NVIDIA NGC MXNet container.