Question about memory usage during using Multiple GPUs

Hello,

This is more of a question about Gluon’s behavior while using multiple GPUs. I tried running one of the samples codes(http://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html) in the Gluon notebook on a p2.16xlarge box. Here’s the output of nvidia-smi with GPU_COUNT set to 8:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 60550 C python3 391MiB |
| 1 60550 C python3 262MiB |
| 2 60550 C python3 262MiB |
| 3 60550 C python3 261MiB |
| 4 60550 C python3 261MiB |
| 5 60550 C python3 261MiB |
| 6 60550 C python3 261MiB |
| 7 60550 C python3 261MiB |
±----------------------------------------------------------------------------+

I don’t expect the memory utilization to be exactly equal across all the GPUs. The difference is greater if I increase the GPU_COUNT to 16. Now, I know the batch size also is multiplied by the GPU_COUNT but the initial batch is stored on the CPU. Is there any additional changes to code that need to be made to ensure a more even memory utilization.

I am actually trying to run a simple FC Network to compute Knowledge Base Embeddings using multiple GPUs. On the same box, the max GPU count I can set to is 4 and anything above fails with a OOM.

Thanks,
Rahul

gpu(0) is used by default by the trainer to do both the gradient aggregation across all the devices and to also perform the parameter updates. That’s why you see a higher memory footprint on that gpu.
The default kvstore used when instantiating a gluon.Trainer is ‘device’ which corresponds to what I just explained. Alternative is ‘local’ which will cause both the aggregate and the updates to happen on CPU freeing some of that gpu memory at the expense of copying data out of gpu to cpu (slower than between gpus).

With the current model I am working on, I see gpu(0) getting ~2.5 more memory allocated than the rest and that’s probably because I use RMSProp that requires storing a running average of previous squared gradients. I assume you are using sgd or nesterov to have a smaller memory footprint relative to the other gpus.

Some more info here https://mxnet.incubator.apache.org/api/python/kvstore/kvstore.html#mxnet.kvstore.create

Thanks for the explanation @bejjani!