Multi system multi gpu distributed training slower than single system multi-gpu

If I run single-system multi-gpu on V100, I can get upto (kvstore = device) (4 V100 with 16 GB RAM each)
Speed: 40168.93 samples/sec,

If I use the same code, on mutli-system multi-gpu setting and setting kvstore to dist_sync, dist_sync_device the speed drops to
Speed: 4854.95 samples/sec
(2 machines each has 4 V100 with 16 GB RAM each)

With gradient compression as discussed in
I get 6339.16 samples/sec which is slighlty better but still worse than what I get on single device.

I am following steps discussed here

I tried with following setup

setup 1
1 scheduler running on separate machine
1 server and 1 worker on 1 GPU machine
1 server and 1 worker on 1 GPU machine

setup 2

1 scheduler running on separate machine
1 server on 1 CPU machine
1 server on 1 CPU machine
1 worker on 1 GPU machine
1 worker on 1 GPU machine

mxnet version : 1.1.0
dist_async doesn’t seem to work with 1.1.0. I will try with higher version. Before that need to understand why there is significant drop in performance when distributed mode is used.


  • Can you check the gpu consumption while training is running? If GPU consumption is low, it could mean loading data is being the bottleneck. Do you have data in some network storage that could be slow?
  • For distributed training to be fast, Compute/IO ratio must me high. If you are use a small model, this ratio could be low. What model are you using?
  • What bandwidth do you have for network communication between your machines? Can you use iperf to test if the network actually can support the bandwidth you need.
1 Like

Thanks indu for response.

  • These are run on AWS instance. Not sure about the bandwidth but should be fast enough i believe. I will check with iperf.
  • I am trying now with 2 layer RNN LSTM and it has big enough parameters with large vocabulary.
  • Datas are copied to the local before accessing. Network storage is not used. But need to optimize it further. GPU consumption is low. I see some GPU usage of 26% and some 5%. I will check it further.

Question I have is

  • On a single system with multi-GPU I get 40k samples per second. That is pretty decent performance. With data reading or batching optimization I might get better performance.
  • However on multi-system multi-GPU it drops close to 6k samples per second. Network bandwidth could be an issue. Not sure how much it has an effect. I can expect a drop performance due to network overhead, however that should not be to the extent I am observing. I am running profiling and that might give some more information on this drop in performance.

Do you know how many batches per second you are processing on single machine? batches_per_sec along with combined size of all parameters will give you an idea of how much data needs to be sent and received per sec to get the same 40k samples/sec. You can then check the available bandwidth with iperf. Don’t be surprised if you were expecting 20 gbps but you are only getting 6 gbps - I’ve seen this. If you are using EC2, try to place the machines in one cluster placement group for low latency and high throughput.

5% to 26% GPU utilization is pretty low. Try to find out where the bottleneck is. Try to disable parameter updates completely and see what GPU utilization you get. Try to run the network with random data generated in memory instead of reading from the disk. (Your drive is probably EBS which is network dependent). See if that improves GPU utilization. Try to narrow down and find out where the bottleneck is.

I will take your suggestion and check further.

Regarding the GPU utilization. On a single machine with 4-GPUS the GPU utilization is close to 80% (kvstore=device). If I use distributed setup with kvstore=dist_sync and one server and one worker each on different machine the utilization drops to less than 25%. With two worker and two server the utilization drops below 20% and I guess it averages around 10%.