Very slow GPU initialization in nvidia docker, but not host

I have a machine using Nvidia-docker with mxnet inside. When the first GPU command is given, such as mx.nd.ones(3, ctx=mx.gpu()), it takes a very long time to get going. About 2 minutes. However, outside the docker and on the host, it takes just a couple seconds for the first round. (Both host and the docker image are using Mxnet 1.2.1)

To make matters more confusing, I also have a desktop machine, and when I run mxnet in its nvidia docker, it’s fast.

These should be the same docker images. So why is the first mxnet GPU call so slow only on one machine’s nvidia docker when that same machine is also fast outside of nvidia docker?

Are there some settings I can play with to avoid this issue?

Are other non-GPU commands fast?

Yeah, CPU is instant.

To be clear, this just regards the GPU initialization. Once the very slow first GPU command goes through, it too is instant (well, instant for tiny compute commands like making an ndarray of ones :stuck_out_tongue: )

What’s the difference between your machine and your desktop machine? I feel this might be an nvidia-docker issue.

I’m also thinking its some interaction with nvidia docker since the host machine works perfectly fine with the same version of mxnet. Its only when run inside the docker that it’s slow on the first command.

However, I’m not sure what is involved in that initialization process to determine what might be the issue.

To answer your question, The desktop is a tower and the computer I’m having an issue with is a small form factor Alienware. Both run Ubuntu 16.04 and are using CUDA 8. The Tower has a 1080 and the Alienware has a 960.

The drivers on the tower are slightly older (370 something). I’ve tried the alienware on both 386 and 396. (I believe they were). This type of setup is shared across other employees at my company and some of their desktops have have later drivers than my desktop. All desktops are fine, and all of the Alienwares we have (four of them) have this slow issue in the nvidia docker though.

It’s possible things were not set up the same on the Alienware computers compared to the desktops, but its not clear. And without knowing what is going on in the GPU initialization phase, it’s hard to troubleshoot what the actual meaningful difference might be.

For any interested, I’m pretty sure I found the issue and have an idea how how to fix it. The core issue is nvidia’s fat binaries and JIT compilation.

If you export the following variables (as suggested here):

export CUDA_CACHE_MAXSIZE=2147483647

Then the first GPU call will still be slow as nvidia compiles the code, but now it will be cached so that any subsequent process that calls to the GPU will only take a second or two on its first GPU call by using the cached code.

Of course, in a docker, it’s a bit trickier because the next time you start the docker, that will be gone. But that should be resolvable by mapping a directory from your host computer to the docker~/.nv. By mapping a directory in, data saved there will be saved on your host and always pushed back in, so that in the future, a new docker start doesn’t lose the cache and it will stay fast.


@jmacglashan did this solve the issue?

It did indeed fix it :slight_smile: Once I mapped the cache directory into docker to retain the cache and increased the size, it’s been fine.