How to speed up the train of neural network model with mxnet?


I want to train a neural network model with mxnet. Basically it has 2 hidden layers, one with 1024 nodes and the other with 512 nodes. The input nodes are 250k while the output is 200k. It’s a fully connected network and I use the following pseudo code for the train:

net = mx.sym.load(model_file)
ctx = [mx.gpu(i) for i in range(8)]
model = mx.mod.Module(
#load train input data
#load train output data
for each epoch:
    for each batch:
        #prepare the train input/output data for current batch, train_out,batch_size) 
        for batch in train_iter:
            model.forward(batch, is_train=True)

The job is running on a 8-GPU host. There are about 2M train samples and the batch size is 256. It takes more than 3 hours for just single epoch. Some profiling shows that nearly half of the time spent on preparation the train_iter, for current batch while the other half is for the model forward/backward/update.

In addition to run this with multiple-host, may I ask is there any other suggestion to speed up this train process in single host? Increase batch number? Compile mxnet with NNPACK? KVStore with device setup? Really appreciate that.

Is that the actual code you are running?

Almost. For each batch the actual logic is that:

get next 256 train input samples and stored into with train_batch_input;
get next 256 train output samples and stored with train_batch_output;
tr_iter =, train_batch_output, 256)
go on with the model forward/backward/update

Using a prefetchingIter would significantly speed up the data loading.

@zhreshold Could you share some details?

Looks like create the NDArrayIter with context of mx.gpu() could help improve the speed. But is it possible to apply this technique to multiple GPUs?

Your model is fairly huge (~ 1.5 G) . With 8 GPUs doing data parallel training my suspicion is you are saturating the PCIe bandwidth during optimization. What batch size are you running? You can measure how much bandwidth is being used using

The batch size is 256 currently.

wrap a around your current iterator.

@zhreshold will give it a shot and thanks. Meanwhile is there any easy way to do the model parallelism for this simple model? I know there is a lstm tutorial but it’s not easy to leverage that into my module based case.

Do you mind sharing your raw profiler output after using mx.gpu() for NDArrayIter? What is the format of data set you’re using?

I’d like to share data collected by mxnet profiler for the train job. As mentioned before, the input/output dimension is 250K/200K with two hidden layers. The batch size is 128 and the input was pretty sparse. The optimizer is AdaMax.

There were three experiments, 1) ran the train job for 500 batches with single GPU; 2) ran the train job on 2 GPUs for 500 batches with kvstore equals to “device”; 3) ran the train job on 4 GPUs for 500 batches. For these jobs the PrefetchingIter and a customized iterator is used for feeding batches into the module.

The single GPU ran for about 263 sec; the 2-GPU experiment lasted 297 sec; the 4-GPU job ran for 458 sec. The tracing UI shows that the time spent on different tasks for single GPU job are listed below:

WaitForVar 25.475 ms
CopyCPU2GPU 69.975 ms
SyncCopyCPU2GPU 38,354.100 ms
CopyGPU2GPU 36,198.746 ms
SyncCopyGPU2CPU 54,909.424 ms
DeleteVariable 5.367 ms
[FullyConnected,Activation] 24,724.460 ms
_zeros 20.251 ms
[_backward_Activation] 1,496.502 ms
_backward_FullyConnected 15,115.844 ms
adam_update 75,968.027 ms
sum 1,341.661 ms
Totals 248,229.832 ms

The communication (CPU2GPU/GPU2GPU) occupies about 50% of total time. For the 2-GPU experiment, the time for different tasks are:

WaitForVar 42.113 ms
CopyGPU2GPU 130,333.624 ms
SyncCopyCPU2GPU 41,525.572 ms
DeleteVariable 7.273 ms
CopyCPU2GPU 207.737 ms
SyncCopyGPU2CPU 54,367.060 ms
[_backward_Activation] 7,313.474 ms
_backward_FullyConnected 11,671.674 ms
_zeros 889.516 ms
adam_update 34,903.490 ms
[FullyConnected,Activation] 38,184.385 ms
ElementwiseSum 11,007.014 ms
sum 6,335.426 ms
Totals 336,788.358 ms

The communication spent about 67% of total time. For the 4-GPU job, the time for different tasks are:

WaitForVar 56.862 ms
KVStoreReduce 135,786.403 ms
DeleteVariable 16.684 ms
[_backward_Activation] 70,779.204 ms
_backward_FullyConnected 49,550.718 ms
adam_update 229,746.453 ms
[FullyConnected,Activation] 74,044.214 ms
sum 848.431 ms
SyncCopyGPU2CPU 42,517.441 ms
SyncCopyCPU2GPU 50,147.506 ms
CopyGPU2CPU 515,830.024 ms
CopyCPU2GPU 665,246.646 ms
CopyGPU2GPU 18,461.143 ms
Totals 1,853,031.729 ms

The communicate costs even more to 70% to total time.

i met the same issue, seems when use 8 gpus, the cost of cpu/gpu communication becomes the barrier of speeding up. does anyone have the solution? can we add the number of the parameter servers(like in distribute training)? or can we let more cpu involve?