Hybrid training speed is 20% slower than pytorch

Hi, everyone. I am recently trying to switch from pytorch to mxnet because of the hybrid feature, so I write a benchmark on cifar10. pytorch-mxnet-benchmarks.

However, the result is strange. Mxnet with hybridize is slower than pytorch. I check the dataloader and mxnet’s is slightly faster. I test the inference time by generate random input tensor, and mxnet is about 2x faster than pytorch.

Which part of my code slows down the training speed?

Hi @SunDoge,

Very strange that you’re seeing a reduction in performance after hybridization! Would have expected a speed up. Good case for using the MXNet profiler here. I’ll try profiling your code and see if there’s anything obvious.

So it looks like the issue is related to the DataLoader.

After profiling the code I spotted large intermittent gaps in the processing of the batches. And according to the profiler, the backend (C++) wasn’t doing anything during those gaps, which indicates commands aren’t being queued fast enough by the frontend (Python). Usual cause of this is slow data loading or processing.

I was able to speedup training significantly by increasing the num_workers to 8. And we get back to the usual situation where hybridization improves training speed!

Still, there’s a very strange bug occurring with num_workers=2 which could explain what was happening before. I think it’s related to https://github.com/apache/incubator-mxnet/issues/13126 and https://github.com/apache/incubator-mxnet/pull/13318. Adding hybridization made the network faster, but this put extra strain on the dataloader, which lead to a multiprocessing clash somewhere along the way, thus making it slower than without hybridization.

My results of running your code on AWS EC2 p3.2xlarge (time of 1st epoch):

|                | num_workers | time  |
|----------------|-------------|-------|
| Non-Hybridized | 0           | 19.59 |
| Hybridized     | 0           | 18.26 |
| Non-Hybridized | 2           | 9.76  |
| Hybridized     | 2           | 13.92 |
| Non-Hybridized | 8           | 8.90  |
| Hybridized     | 8           | 7.25  |

So overall you should be able to get around a x2 speedup num_workers=8 compared to num_workers=2 for the hybridized network.

1 Like

I test it with num_workers=8 yesterday.

pytorch=1551s
mxnet-hybridize=1830s

It’s much faster but still slower than pytorch. i’m going to review the source code of both pytorch and mxnet’s Dataloader tonight.

Hi, @thomelane,

Pytorch 1.0 has modified its dataloader.py by using some c/cpp code, which is not shown on the master branch, so my benchmark is not so equivalent :upside_down_face:.

I think it possible to use pytorch Dataloader to load data as ndarray if I use a custom collate_fn. I hope mxnet can take the same strategy to optimize the data-loading process.

Just a couple more suggestions.

  1. Check out mx.io.ImageRecordIter()

MXNet also has C++ optimised methods of loading data, called DataIterators. Although they were designed for the MXNet Module API and not Gluon, you can still use them for Gluon training. mx.io.ImageRecordIter() also loads data stored in a more optimized format called RecordIO, and performs augmentation steps in C++ too.

When I tried benchmarking the performance of Gluon training on CIFAR-10 I used this methods and you can see an example here. Schema has been converted to Gluon DataLoader format.

  1. Give nightly build of MXNet a try

It looks like some of the issues regarding DataLoader have been fixed recently, and will be in v1.4.
Use pip install mxnet-cu92 --pre (if you’re using CUDA 9.2).