Multi system multi gpu distributed training slower than single system multi-gpu

If I run single-system multi-gpu on V100, I can get upto (kvstore = device) (4 V100 with 16 GB RAM each)
Speed: 40168.93 samples/sec,

If I use the same code, on mutli-system multi-gpu setting and setting kvstore to dist_sync, dist_sync_device the speed drops to
Speed: 4854.95 samples/sec
(2 machines each has 4 V100 with 16 GB RAM each)

With gradient compression as discussed in https://mxnet.incubator.apache.org/faq/gradient_compression.html
I get 6339.16 samples/sec which is slighlty better but still worse than what I get on single device.

I am following steps discussed here https://mxnet.incubator.apache.org/versions/master/faq/distributed_training.html

I tried with following setup

setup 1
1 scheduler running on separate machine
1 server and 1 worker on 1 GPU machine
1 server and 1 worker on 1 GPU machine

setup 2

1 scheduler running on separate machine
1 server on 1 CPU machine
1 server on 1 CPU machine
1 worker on 1 GPU machine
1 worker on 1 GPU machine

mxnet version : 1.1.0
dist_async doesn’t seem to work with 1.1.0. I will try with higher version. Before that need to understand why there is significant drop in performance when distributed mode is used.

Thanks,

  • Can you check the gpu consumption while training is running? If GPU consumption is low, it could mean loading data is being the bottleneck. Do you have data in some network storage that could be slow?
  • For distributed training to be fast, Compute/IO ratio must me high. If you are use a small model, this ratio could be low. What model are you using?
  • What bandwidth do you have for network communication between your machines? Can you use iperf to test if the network actually can support the bandwidth you need.
1 Like

Thanks indu for response.

  • These are run on AWS instance. Not sure about the bandwidth but should be fast enough i believe. I will check with iperf.
  • I am trying now with 2 layer RNN LSTM and it has big enough parameters with large vocabulary.
  • Datas are copied to the local before accessing. Network storage is not used. But need to optimize it further. GPU consumption is low. I see some GPU usage of 26% and some 5%. I will check it further.

Question I have is

  • On a single system with multi-GPU I get 40k samples per second. That is pretty decent performance. With data reading or batching optimization I might get better performance.
  • However on multi-system multi-GPU it drops close to 6k samples per second. Network bandwidth could be an issue. Not sure how much it has an effect. I can expect a drop performance due to network overhead, however that should not be to the extent I am observing. I am running profiling and that might give some more information on this drop in performance.

Do you know how many batches per second you are processing on single machine? batches_per_sec along with combined size of all parameters will give you an idea of how much data needs to be sent and received per sec to get the same 40k samples/sec. You can then check the available bandwidth with iperf. Don’t be surprised if you were expecting 20 gbps but you are only getting 6 gbps - I’ve seen this. If you are using EC2, try to place the machines in one cluster placement group for low latency and high throughput.

5% to 26% GPU utilization is pretty low. Try to find out where the bottleneck is. Try to disable parameter updates completely and see what GPU utilization you get. Try to run the network with random data generated in memory instead of reading from the disk. (Your drive is probably EBS which is network dependent). See if that improves GPU utilization. Try to narrow down and find out where the bottleneck is.

I will take your suggestion and check further.

Regarding the GPU utilization. On a single machine with 4-GPUS the GPU utilization is close to 80% (kvstore=device). If I use distributed setup with kvstore=dist_sync and one server and one worker each on different machine the utilization drops to less than 25%. With two worker and two server the utilization drops below 20% and I guess it averages around 10%.

Any help may needed from your side. expect your replay. thanks a lot.
I used tf2.5 to do the distributed training, device config detail info is 2 machince with 2GPUs each on them, means 2m4GPU. And these 2machinces network are connected by the optical fiber and bandwidth can be support 10Gbit/s = 1250MB/s. the device include machinces or GPU type or memory with these 2machinse are ALL same.

Let’s compare following 2 test case:

1m2GPU:
I used the TF ditributed startagy MirroredStrategy to do the 1machinse 2 GPU trianing, cost TRAINING time is 1522 seconds which the training task is albert base 12layers text classification tast. the GPU memory percent used is96.88% and utilation is 95.78%;

2m4GPU:
I used the TF ditributed startagy MultiWorkerMirroredStrategy to do the 2machinse 4 GPU trianing, cost TRAINING time is 1013 seconds which the training task is albert base 12layers text classification tast. the GPU memory percent used is82.99% and utilation is 71.89%;

Comparw with 1m2GPu with 2m4GPU,
found that mulit machinces multic GPU saving time just : (1522-1013)/1522 = 33.4% (maybe) < 50% as expected . I also monitor the network bandwidth between 2 machines, found 152MB/s used in average.
The CPU used and memory used is 170% and 2.25% and all are not the bottleneck
Any method can i used to improve and accelerate the multi machinse multi GPU distributed training?

other question, during multi machinse multi GPU distributed training, the GPU memory usage is lower than the single machines multc GPU training , what’s root cause of this? and GPU utilation is also becomer lower.

I also check some method to accelerate , such as use the data input pipeline(e,g, prefetch/map_and_batch/num_parallel_batches/shuffle/repeat ). seems that pipeline doesn’t bring obvious acclerate change.
Do you hava any good suggestion to acclerate this ?
Thanks a lot.