Is there a flag to enable tensor cores? No speedup from FP16

olegtrott · February 24, 2020, 3:40am

On my RTX 2080ti, dot products are no faster with FP16 than with FP32 (and the former is 4 times slower than equivalent PyTorch). Is there some flag or environment variable that I’m missing?

import mxnet as mx
import numpy as np
import time

n = 2**14

ctx = mx.gpu(0)
dtype = np.float16

with ctx:
    a = mx.nd.zeros((n, n), dtype=dtype)
    b = mx.nd.zeros((n, n), dtype=dtype)
    c = mx.nd.zeros((n, n), dtype=dtype)


tic = time.time()
for _ in range(100):
    mx.nd.dot(a, b, out=c)
    res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)

(This outputs about 60 for either dtype)

olegtrott · February 24, 2020, 8:08am

I tried these environment variables (after looking at the source code), but they make no difference in terms of speed. CuDNN is version v7600, apparently.

CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout MXNET_CUDA_ALLOW_TENSOR_CORE=1 MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION=1

TristonC · February 25, 2020, 6:40pm

Tried your code on my volta, float16 gave me 8.9s while float32 gave me 61.7s. You might need to profile it.
On 2080Ti with MXNet 1.6:

FP16: 9.2s
FP32 56.56.

tranorrepository · March 18, 2020, 12:16pm

how do you install mxnet? by pip or build from source

TristonC · March 20, 2020, 7:05pm

Build from the source. Or you can try NVIDIA NGC MXNet container.

Topic		Replies	Views
Accelerating FP16 Inference on Volta Performance unix-based , general-question	7	2741	February 15, 2019
Inference performance with fp16 float16 GTX 1080 ti Performance python , gluon , performance	2	3299	February 21, 2019
Using mxnet with CUDA freezes python Performance	1	656	June 2, 2021
Mxnet.nd.sum and dot ~10x slower than numpy? Performance	3	1292	June 19, 2018
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2976	January 20, 2019

Is there a flag to enable tensor cores? No speedup from FP16

Related Topics