Distributed Gluon HybridBlock is much much slower than Symbol

dmadeka · December 19, 2017, 1:28am

This is an ongoing thread with a few people on MXNet but distributed Gluon HybridBlock is way slower than keras-mxnet. Benchmarking an MLP - we have a 4x slowdown using Gluon than we did using Keras:

The entire epoch time for Keras is 60s using 8 GPUs:

48641254/48641254 [==============================] - 62s - loss: 0.4282 - val_loss: 0.3570 
Epoch 2/10 
48641254/48641254 [==============================] - 61s - loss: 0.4074 - val_loss: 0.3546 
Epoch 3/10 
48641254/48641254 [==============================] - 61s - loss: 0.4058 - val_loss: 0.3537 
Epoch 4/10 
48641254/48641254 [==============================] - 61s - loss: 0.4048 - val_loss: 0.3533

For Gluon, 1000 batches takes about 224s using 8 GPUs and proper hybridization:

Epoch [0]: Interval [0/6000] Train-QLMeanMetric:  Speed: 53.17s
Epoch [0]: Interval [1000/6000] Train-QLMeanMetric:  Speed: 224.43s
Epoch [0]: Interval [2000/6000] Train-QLMeanMetric:  Speed: 227.38s

Before we get into code, is there any reference implementation for Distributed Hybrid Block other than this. We’ve tried a bunch of things - but none seem to work, including inherting the loss from gluon.loss and overriding the hybrid_forward function.

Is there a source or documentation on how to debug the massive slowdown and find the bottleneck?

cbarber · December 20, 2017, 8:46pm

Did you call hybridize?

dmadeka · December 20, 2017, 9:21pm

net.hybridize()
loss.hybridize()

Topic		Replies	Views
Gluon implementation much slower than Symbolic Performance	9	1702	August 20, 2018
Mxnet 1.3.1: speed/performance differences between the mxnet gluon and module/symbol APIs of at least a factor of 2 Performance	11	1380	February 27, 2019
Very slow initialisation of GPU distributed training Gluon	7	1303	September 7, 2020
Hybrid training speed is 20% slower than pytorch Performance	5	1327	January 11, 2019
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2982	January 20, 2019

Distributed Gluon HybridBlock is much much slower than Symbol

Related Topics