NDArray "cold start" on GPU?

I’m doing some quick test to compare speed of NDArray vs numpy, and of GPU vs CPU. All is done in the mxnet_36 kernel of sagemaker p3.8xl notebook.

My test data is the following:

# Numpy
a = np.random.rand(10**3, 10**5)
b = np.random.rand(10**5, 10**4)
c = np.random.rand(10**3, 10**4)

# Numpy to NDArray
A = mx.nd.array(a)
B = mx.nd.array(b)
C = mx.nd.array(c)

this runs in 5.7s:

y = np.tanh(np.dot(a, b) + c)

this runs in 2.6s:

Y = nd.tanh(nd.dot(A, B) + C)

the copy to GPU takes…39s! Sounds intuitively a bit long no?:

A_gpu = A.as_in_context(mx.gpu(0))
B_gpu = B.as_in_context(mx.gpu(0))
C_gpu = C.as_in_context(mx.gpu(0))

the matrix multiplication + addition on GPU takes…18s!

Y_gpu = nd.tanh(nd.dot(A_gpu, B_gpu) + C_gpu)

GPU 7 times slower than CPU for a matrix-multiply operation which is allegedly the strength of the V100… When I re-run the thing, next iterations take around 150ms. What is wrong with the first run? Why is there this “cold start” ?

Hi, I think this is because by default mxnet when in the first run performs optimization to find the best operations for your algorithm. The standard thing I see everytime I run my code is:

[14:12:47] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

there are a number of optimization environmental variables, you can play with and see differences in performance.

This is almost surely because of the CUDA initialization not being cached. See my post here about how to avoid the slow start up by caching the code and keeping it in your docker.: