Is there a flag to enable tensor cores? No speedup from FP16

On my RTX 2080ti, dot products are no faster with FP16 than with FP32 (and the former is 4 times slower than equivalent PyTorch). Is there some flag or environment variable that I’m missing?

import mxnet as mx
import numpy as np
import time

n = 2**14

ctx = mx.gpu(0)
dtype = np.float16

with ctx:
    a = mx.nd.zeros((n, n), dtype=dtype)
    b = mx.nd.zeros((n, n), dtype=dtype)
    c = mx.nd.zeros((n, n), dtype=dtype)

tic = time.time()
for _ in range(100):, b, out=c)
    res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)

(This outputs about 60 for either dtype)

I tried these environment variables (after looking at the source code), but they make no difference in terms of speed. CuDNN is version v7600, apparently.


Tried your code on my volta, float16 gave me 8.9s while float32 gave me 61.7s. You might need to profile it.
On 2080Ti with MXNet 1.6:

  • FP16: 9.2s
  • FP32 56.56.

how do you install mxnet? by pip or build from source

Build from the source. Or you can try NVIDIA NGC MXNet container.