Why CPU load is so heavy during training? How can I reduce it

I am training a model similar to ResNet 50 using a server having 8 Tesla V100 GPU and the CPU has 72 virtual cores. I find it is really strange that mxnet will use about 2500% CPU during training with a single GPU. This cause a big problem for me.

When I run 8 dockers on this server (each docker runs a program to train with one GPU), I find the training speed is extremely low. If I only open 1 docker, I can achieve a training speed of 380 FPS, but if open 8 dockers, the training speed is only about 10 FPS. I have never have similar problem when using pytorch or tensorflow. The CPU load is always ~500% per process.

So why the CPU load is so heavy, and what can I do to reduce it? I’ve never imaging that I cannot run 8 GPU training process on a server with 72-core CPU ……

Hi @qjfytz,

I’m assuming you’re training everything on the GPU here, rather than on the CPU where optimised libraries like MKL intentionally use up many cores to maximise the usage of the hardware for performance.

Usually when CPU load is high during GPU training the CPU is working on data loading and pre-processing. You could try limiting the number of workers in your DataLoader. Also make sure the kvstore of your training/optimizer is set to device otherwise you might be adding load to your CPU for weight updates.

Another method might be to limit the resource allocated to each docker container, but this is outside of the scope of MXNet. After a little google I came across this StackOverflow post that might be of use: https://stackoverflow.com/questions/26841846/how-to-allocate-50-cpu-resource-to-docker-container. It looks like you might be able to use --cpuset and/or --cpu-shares to split cpus between tasks.

Thanks @thomelane! It’s very nice of you to give me so many useful advices.

As you mentioned, I need to set the number of workers of DataLoder and set the kvstore to ‘device’. In my program, I mannually define a DataIter which is a subclass of mx.io.DataIter and define a next() method to iterate my dataset. But I don’t know how to set the number of workers of the self-defined data iter. Could you give me some advice? For the second advice, I have set the ‘kvstore’ parameter to ‘device’ when calling model.fit().

I still think the problem is too many CPU threads are open. The number of threads is 147 when I just use one GPU, and is 162 when I use four. It’s really a big number.

Unfortunately I’m not as familiar with mx.io.DataIter, but I think it uses multiple threads rather than multiple processes. Some DataIters (such as ImageRecordIter) have a preprocess_threads option which you could limit, but I don’t know exactly which DataIter you’re extending to know whether this is relevant for you.

We still need to identify the problem first though, before speculating on a fix.

One idea to isolate the issue (given we think it could be data loading), would be to remove the neural network code (no forward or backward pass) and just loop through dataset for a few epochs and benchmark this.

You can also try using the MXNet Profiler which should break out the different threads (if the threads are swawned by the backend process, rather than the frontend process). See this tutorial for details of how to do this.

And lastly there’s a page that list the environment variables that can be used with MXNet: many of which relate to the number of threads to use for different tasks.

1 Like

I managed to find a method to solve this problem by mannually set the environment variable ‘OMP_NUM_THREAD’ = 4 * num_GPU_used. The number of threads can be reduced by about 90 and everything works well. Maybe this variable is related to the data loading process since I find I cannot set it’s value to a too small one. It’s still a little strange that I thought this variable is only related to the performance when we use CPU to do training.

Thanks for your patience and nice answers. They help me a lot to find the final solution!