Hybrid training speed is 20% slower than pytorch

SunDoge · January 10, 2019, 6:19pm

Hi, everyone. I am recently trying to switch from pytorch to mxnet because of the hybrid feature, so I write a benchmark on cifar10. pytorch-mxnet-benchmarks.

However, the result is strange. Mxnet with hybridize is slower than pytorch. I check the dataloader and mxnet’s is slightly faster. I test the inference time by generate random input tensor, and mxnet is about 2x faster than pytorch.

Which part of my code slows down the training speed?

thomelane · January 11, 2019, 12:28am

Hi @SunDoge,

Very strange that you’re seeing a reduction in performance after hybridization! Would have expected a speed up. Good case for using the MXNet profiler here. I’ll try profiling your code and see if there’s anything obvious.

thomelane · January 11, 2019, 2:05am

So it looks like the issue is related to the DataLoader.

After profiling the code I spotted large intermittent gaps in the processing of the batches. And according to the profiler, the backend (C++) wasn’t doing anything during those gaps, which indicates commands aren’t being queued fast enough by the frontend (Python). Usual cause of this is slow data loading or processing.

I was able to speedup training significantly by increasing the num_workers to 8. And we get back to the usual situation where hybridization improves training speed!

Still, there’s a very strange bug occurring with num_workers=2 which could explain what was happening before. I think it’s related to https://github.com/apache/incubator-mxnet/issues/13126 and https://github.com/apache/incubator-mxnet/pull/13318. Adding hybridization made the network faster, but this put extra strain on the dataloader, which lead to a multiprocessing clash somewhere along the way, thus making it slower than without hybridization.

My results of running your code on AWS EC2 p3.2xlarge (time of 1st epoch):

|                | num_workers | time  |
|----------------|-------------|-------|
| Non-Hybridized | 0           | 19.59 |
| Hybridized     | 0           | 18.26 |
| Non-Hybridized | 2           | 9.76  |
| Hybridized     | 2           | 13.92 |
| Non-Hybridized | 8           | 8.90  |
| Hybridized     | 8           | 7.25  |

So overall you should be able to get around a x2 speedup num_workers=8 compared to num_workers=2 for the hybridized network.

SunDoge · January 11, 2019, 4:23am

I test it with num_workers=8 yesterday.

pytorch=1551s
mxnet-hybridize=1830s

It’s much faster but still slower than pytorch. i’m going to review the source code of both pytorch and mxnet’s Dataloader tonight.

SunDoge · January 11, 2019, 9:41am

Hi, @thomelane,

Pytorch 1.0 has modified its dataloader.py by using some c/cpp code, which is not shown on the master branch, so my benchmark is not so equivalent .

I think it possible to use pytorch Dataloader to load data as ndarray if I use a custom collate_fn. I hope mxnet can take the same strategy to optimize the data-loading process.

thomelane · January 11, 2019, 7:57pm

Just a couple more suggestions.

Check out mx.io.ImageRecordIter()

MXNet also has C++ optimised methods of loading data, called DataIterators. Although they were designed for the MXNet Module API and not Gluon, you can still use them for Gluon training. mx.io.ImageRecordIter() also loads data stored in a more optimized format called RecordIO, and performs augmentation steps in C++ too.

When I tried benchmarking the performance of Gluon training on CIFAR-10 I used this methods and you can see an example here. Schema has been converted to Gluon DataLoader format.

Give nightly build of MXNet a try

It looks like some of the issues regarding DataLoader have been fixed recently, and will be in v1.4.
Use pip install mxnet-cu92 --pre (if you’re using CUDA 9.2).

Topic		Replies	Views
MXNet vs Pytorch Benchmark Performance	3	2263	May 27, 2019
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	3014	January 20, 2019
MXNet 8 times slower than Numpy in a simple example Performance	0	533	April 14, 2020
`MXImperativeInvokeEx` is taking a long time Performance	8	778	January 6, 2019
Gluon implementation much slower than Symbolic Performance	9	1726	August 20, 2018

Hybrid training speed is 20% slower than pytorch

Related Topics