This is an ongoing thread with a few people on MXNet but distributed Gluon HybridBlock is way slower than keras-mxnet. Benchmarking an MLP - we have a 4x slowdown using Gluon than we did using Keras:

The entire epoch time for Keras is 60s using 8 GPUs:

```
48641254/48641254 [==============================] - 62s - loss: 0.4282 - val_loss: 0.3570
Epoch 2/10
48641254/48641254 [==============================] - 61s - loss: 0.4074 - val_loss: 0.3546
Epoch 3/10
48641254/48641254 [==============================] - 61s - loss: 0.4058 - val_loss: 0.3537
Epoch 4/10
48641254/48641254 [==============================] - 61s - loss: 0.4048 - val_loss: 0.3533
```

For Gluon, 1000 batches takes about 224s using 8 GPUs and proper hybridization:

```
Epoch [0]: Interval [0/6000] Train-QLMeanMetric: Speed: 53.17s
Epoch [0]: Interval [1000/6000] Train-QLMeanMetric: Speed: 224.43s
Epoch [0]: Interval [2000/6000] Train-QLMeanMetric: Speed: 227.38s
```

Before we get into code, is there any reference implementation for Distributed Hybrid Block other than this. We’ve tried a bunch of things - but none seem to work, including inherting the loss from `gluon.loss`

and overriding the `hybrid_forward`

function.

Is there a source or documentation on how to debug the massive slowdown and find the bottleneck?