Understanding MXNet multi-gpu performance

So, I’m trying to understand a question about how distributed training in MXNet really works (in order to optimize my code):

  • I understand that Mu Li’s Paper is a good reference. But I couldn’t find an answer to this question-

Im training two networks, one hyper parameter (a regularization coefficient) apart. The first one is training on the first 8 GPU’s, and the second one on all 16 GPU’s. The samples per second are kind of the same, but the second one is slower:

First network:

INFO:root:Epoch[37] Batch [30256]	Speed: 197771.84 samples/sec	QLSumMetric=729.808672
INFO:root:Epoch[37] Batch [30744]	Speed: 206046.96 samples/sec	QLSumMetric=729.015916
INFO:root:Epoch[37] Batch [31232]	Speed: 178224.72 samples/sec	QLSumMetric=727.661732
INFO:root:Epoch[37] Batch [31720]	Speed: 167553.20 samples/sec	QLSumMetric=724.259217
INFO:root:Epoch[37] Batch [32208]	Speed: 208376.83 samples/sec	QLSumMetric=732.663074
INFO:root:Epoch[37] Batch [32696]	Speed: 206057.11 samples/sec	QLSumMetric=716.422353
INFO:root:Epoch[37] Batch [33184]	Speed: 187960.58 samples/sec	QLSumMetric=733.493551
INFO:root:Epoch[37] Batch [33672]	Speed: 208166.71 samples/sec	QLSumMetric=733.055588
INFO:root:Epoch[37] Batch [34160]	Speed: 192618.61 samples/sec	QLSumMetric=723.640843

Second network:

INFO:root:Epoch[13] Batch [11224]	Speed: 101530.47 samples/sec	QLSumMetric=736.330289
INFO:root:Epoch[13] Batch [11712]	Speed: 105602.55 samples/sec	QLSumMetric=734.239894
INFO:root:Epoch[13] Batch [12200]	Speed: 104586.31 samples/sec	QLSumMetric=744.775742
INFO:root:Epoch[13] Batch [12688]	Speed: 107612.01 samples/sec	QLSumMetric=743.912667
INFO:root:Epoch[13] Batch [13176]	Speed: 106278.79 samples/sec	QLSumMetric=738.141423
INFO:root:Epoch[13] Batch [13664]	Speed: 105420.35 samples/sec	QLSumMetric=736.574530
INFO:root:Epoch[13] Batch [14152]	Speed: 101914.44 samples/sec	QLSumMetric=741.377023
INFO:root:Epoch[13] Batch [14640]	Speed: 106754.43 samples/sec	QLSumMetric=744.095754
INFO:root:Epoch[13] Batch [15128]	Speed: 106558.59 samples/sec	QLSumMetric=744.974015
INFO:root:Epoch[13] Batch [15616]	Speed: 104182.89 samples/sec	QLSumMetric=736.113241

The second network is at about half the speed (the speeds themselves are fine, I just wanted to understand). gpustat printed below:

[0] Tesla K80        | 65'C,  52 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[1] Tesla K80        | 56'C,  52 % |   321 / 11439 MB | ubuntu(161M) ubuntu(153M)
[2] Tesla K80        | 76'C,  54 % |   328 / 11439 MB | ubuntu(168M) ubuntu(153M)
[3] Tesla K80        | 63'C,  56 % |   330 / 11439 MB | ubuntu(170M) ubuntu(153M)
[4] Tesla K80        | 67'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[5] Tesla K80        | 54'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[6] Tesla K80        | 69'C,  49 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[7] Tesla K80        | 58'C,  51 % |   326 / 11439 MB | ubuntu(157M) ubuntu(162M)
[8] Tesla K80        | 56'C,  33 % |   176 / 11439 MB | ubuntu(172M)
[9] Tesla K80        | 47'C,  37 % |   179 / 11439 MB | ubuntu(175M)
[10] Tesla K80        | 63'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[11] Tesla K80        | 52'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[12] Tesla K80        | 59'C,  16 % |   159 / 11439 MB | ubuntu(153M)
[13] Tesla K80        | 49'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[14] Tesla K80        | 64'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[15] Tesla K80        | 54'C,  16 % |   157 / 11439 MB | ubuntu(153M)

what the batch-size of first and second network, and what’s your model?

20k samples per second suggests your network is way too small to be trained on multiple gpus.

You need a network at least as big as Alexnet to benefit from multi-gpu training

1 Like

Both networks have a batch size of 2048, it’s a 215 Input x512x512x570 MLP

  1. if 8 gpu use 2048 batch-size, then 16gpu you should use 4096 to compare the performance.
  2. as @piiswrong said, your model is too small to test multi-gpu performance.

Just rephrasing this - if you have such high sample/s throughput, you’re really pushing the limits of the python frontend. Try the same setup with a problem where the GPUs have some real work to do (rather than just shuffling data to/from the GPU). Also note that the P2.8xlarge and P2.16xlarge the PCI express bus behaves a bit differently since in the latter case all 16 GPUs are sharing one CPU. This might also influence it but the main issue is that your problem is ‘too simple’.

Understood. Though I did see better performance with multiple GPUs than with a single one - which caps off at 110-115K samples/second (vs 200K for 8 GPUs).

Thanks for the responses and help! This forum is a great idea!

1 Like

Hi @piiswrong I am interested in this point “You need a network at least as big as Alexnet to benefit from multi-gpu training”: what metrics would you use (both from GPU and model) in order to know qualitatively if multi-GPI training is relevant for a model?