How to speed up the train of neural network model with mxnet?

mg0880gm · October 11, 2017, 10:06pm

Hi,

I want to train a neural network model with mxnet. Basically it has 2 hidden layers, one with 1024 nodes and the other with 512 nodes. The input nodes are 250k while the output is 200k. It’s a fully connected network and I use the following pseudo code for the train:

net = mx.sym.load(model_file)
ctx = [mx.gpu(i) for i in range(8)]
model = mx.mod.Module(
    symbol=net,
    context=ctx,
    data_names=['data']
    label_names=['label']
    )
#load train input data
#load train output data
for each epoch:
    for each batch:
        #prepare the train input/output data for current batch
        train_iter=mx.io.NDArrayIter(train_in, train_out,batch_size) 
        for batch in train_iter:
            model.forward(batch, is_train=True)
            model.backward()
            model.update()

The job is running on a 8-GPU host. There are about 2M train samples and the batch size is 256. It takes more than 3 hours for just single epoch. Some profiling shows that nearly half of the time spent on preparation the train_iter, for current batch while the other half is for the model forward/backward/update.

In addition to run this with multiple-host, may I ask is there any other suggestion to speed up this train process in single host? Increase batch number? Compile mxnet with NNPACK? KVStore with device setup? Really appreciate that.

madjam · October 11, 2017, 10:15pm

Is that the actual code you are running?

mg0880gm · October 11, 2017, 10:27pm

Almost. For each batch the actual logic is that:

get next 256 train input samples and stored into with train_batch_input;
get next 256 train output samples and stored with train_batch_output;
tr_iter = mx.io.NDARrayIter(train_batch_input, train_batch_output, 256)
go on with the model forward/backward/update

zhreshold · October 12, 2017, 7:11am

Using a prefetchingIter would significantly speed up the data loading.

mg0880gm · October 12, 2017, 8:04am

@zhreshold Could you share some details?

mg0880gm · October 12, 2017, 8:05am

Looks like create the NDArrayIter with context of mx.gpu() could help improve the speed. But is it possible to apply this technique to multiple GPUs?

madjam · October 12, 2017, 6:13pm

Your model is fairly huge (~ 1.5 G) . With 8 GPUs doing data parallel training my suspicion is you are saturating the PCIe bandwidth during optimization. What batch size are you running? You can measure how much bandwidth is being used using https://github.com/apache/incubator-mxnet/tree/master/tools/bandwidth

mg0880gm · October 12, 2017, 7:33pm

The batch size is 256 currently.

zhreshold · October 12, 2017, 8:56pm

wrap a mx.io.PrefetchingIter around your current iterator.

mg0880gm · October 12, 2017, 9:45pm

@zhreshold will give it a shot and thanks. Meanwhile is there any easy way to do the model parallelism for this simple model? I know there is a lstm tutorial but it’s not easy to leverage that into my module based case.

eric-haibin-lin · October 13, 2017, 4:53pm

Do you mind sharing your raw profiler output after using mx.gpu() for NDArrayIter? What is the format of data set you’re using?

mg0880gm · October 18, 2017, 6:11am

I’d like to share data collected by mxnet profiler for the train job. As mentioned before, the input/output dimension is 250K/200K with two hidden layers. The batch size is 128 and the input was pretty sparse. The optimizer is AdaMax.

There were three experiments, 1) ran the train job for 500 batches with single GPU; 2) ran the train job on 2 GPUs for 500 batches with kvstore equals to “device”; 3) ran the train job on 4 GPUs for 500 batches. For these jobs the PrefetchingIter and a customized iterator is used for feeding batches into the module.

The single GPU ran for about 263 sec; the 2-GPU experiment lasted 297 sec; the 4-GPU job ran for 458 sec. The tracing UI shows that the time spent on different tasks for single GPU job are listed below:

WaitForVar 25.475 ms
CopyCPU2GPU 69.975 ms
SyncCopyCPU2GPU 38,354.100 ms
CopyGPU2GPU 36,198.746 ms
SyncCopyGPU2CPU 54,909.424 ms
DeleteVariable 5.367 ms
[FullyConnected,Activation] 24,724.460 ms
_zeros 20.251 ms
[_backward_Activation] 1,496.502 ms
_backward_FullyConnected 15,115.844 ms
adam_update 75,968.027 ms
sum 1,341.661 ms
Totals 248,229.832 ms

The communication (CPU2GPU/GPU2GPU) occupies about 50% of total time. For the 2-GPU experiment, the time for different tasks are:

WaitForVar 42.113 ms
CopyGPU2GPU 130,333.624 ms
SyncCopyCPU2GPU 41,525.572 ms
DeleteVariable 7.273 ms
CopyCPU2GPU 207.737 ms
SyncCopyGPU2CPU 54,367.060 ms
[_backward_Activation] 7,313.474 ms
_backward_FullyConnected 11,671.674 ms
_zeros 889.516 ms
adam_update 34,903.490 ms
[FullyConnected,Activation] 38,184.385 ms
ElementwiseSum 11,007.014 ms
sum 6,335.426 ms
Totals 336,788.358 ms

The communication spent about 67% of total time. For the 4-GPU job, the time for different tasks are:

WaitForVar 56.862 ms
KVStoreReduce 135,786.403 ms
DeleteVariable 16.684 ms
[_backward_Activation] 70,779.204 ms
_backward_FullyConnected 49,550.718 ms
adam_update 229,746.453 ms
[FullyConnected,Activation] 74,044.214 ms
sum 848.431 ms
SyncCopyGPU2CPU 42,517.441 ms
SyncCopyCPU2GPU 50,147.506 ms
CopyGPU2CPU 515,830.024 ms
CopyCPU2GPU 665,246.646 ms
CopyGPU2GPU 18,461.143 ms
Totals 1,853,031.729 ms

The communicate costs even more to 70% to total time.

GoodJoey · August 10, 2018, 6:41am

i met the same issue, seems when use 8 gpus, the cost of cpu/gpu communication becomes the barrier of speeding up. does anyone have the solution? can we add the number of the parameter servers(like in distribute training)? or can we let more cpu involve?

Topic		Replies	Views
Mxnet forward operation on the first batch is very slow	4	894	October 19, 2017
Single-machine multi-GPU training, time is not speeding up Gluon	5	2162	November 16, 2018
Speed Issue converting NDarray to np.array Performance	2	662	August 21, 2019
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2977	January 20, 2019
Best practices for prediction on a machine with multiple GPUs	3	1190	November 8, 2017

How to speed up the train of neural network model with mxnet?

Related Topics