Mxnet 1.3.1: speed/performance differences between the mxnet gluon and module/symbol APIs of at least a factor of 2

My first experiments with mxnet show a speed difference of at least a factor of 2 (in some models even 4-5) between the module/symbol API (which is faster) and the gluon API (which is slower).

I am currently very new to mxnet and it is quite likely that my approach contains fundamental flaws that might explain the seen differences. I cannot see how, though, as I have taken the code from the mxnet web-site from tutorials and API documentation.

While I noticed the problem with mxnet-cu90mkl in version 1.3.1, I am able to reproduce the problem with the raw mxnet 1.3.1 installation purely on CPU with a very simple 3 layer MLP architecture.

I’ve created a github repository with jupyter notebook(s) to show what I have done:


I’ve also provided the conda environment with exact versions to reproduce what I am seeing.

You can also go directly to the jupyter notebook here: https://nbviewer.jupyter.org/github/cs224/mxnet-gluon-vs-sym-and-others/blob/master/mxnet-gluon-vs-sym-speed.ipynb?flush_cache=true

My questions is if this speed difference is expected or if I am doing something fundamentally wrong?

I’ve added two more notebooks to the repository:

  • tensorflow-keras-speed.ipynb
  • mxnet-keras-speed.ipynb

Both of them are in the order of 34 seconds, e.g. neither in the range of mxnet module/symbol nor of mxnet gluon?? This is puzzleing me even more??

As I was trying to find the answer for you I found something even more confusing for you and me, well sorry cause this wouldn’t be the answer for your query but still…

After checking out this repository I am a bit confused, check this out.
It’s the cnn implemented using gluon api:

Which takes 37 secs to train

And below is the same model but build using symbol api:

Which takes 48 secs to train

Holy smokes, why and how the hell gluon is faster than smbolic api?? Cause in all my personal tests I found symbolic api almost 30% faster than gluon’s hybridized model.
Here’s the link for that test:

Just don’t know what’s going on.
I am now feeling the most confused I have ever been in my life

Thank you very much for pointing me to these resources. The two notebooks are not really comparable, because the number of GPUs and the version of Cuda differ. Not sure if this is the relevant piece or not, but I’ll try to run the notebooks on my machine and pure CPU to have a compare base.

My main question is actually if I should expect (e.g. is it normal) a speed difference of a factor of 2 (or even higher, e.g. sometimes closer to 5) from the Gluon API (in my example I even hybridized the model) or did I do something fundamentally wrong that I did not see. Like above in the medium article a speed difference of 30% would look reasonable. But full integer factors look to me more like I did something fundamentally wrong.

I have read your code, and I think everything is fine. Though I have found 1 minor or may be major causes for this massive performance difference.

In gluon code you are using gluon.data.Dataloader, and in symbolic code you are using mx.io.NDArrayIter. I understand that you are just trying to compare pure gluon code and pure symbolic code, but still I think that is one factor that might effect performance. Try using same dataloader for both models.

From my personal experience I would like to say that everything is fine in your results, and this performance difference is supposed to happen.
One very important reason is that in symbolic programming of mxnet the layers like FullConnected and all others are purely written in c++, without any python code. While gluon is an API using those layers but build on top of python, and as you know python is 100x slower than c++, so the results.

One more thing you can do to reduce such performance difference is using static_shape = True, and static_alloc = True when hybridizing, that is hybridize(static_shape=True, static_alloc=True). I hope this might help.

And sorry for my wrong comparisons of gluon and symbol performance in previous reply, and as u have pointed out that cuda version and no of GPUs differs so… We are even. Cool.
xoxo

Thanks a lot for your hints and feed-back! Due to your help I was able to resolve the problem and I now achieve an even better timing with the gluon API than with the module/symbol API of 20.17 seconds! I’ve updated the repository with the new results, too.

Two changes were necessary:

  • Use hybridize(static_shape=True, static_alloc=True); this improved the speed to roughly 37 seconds (e.g. 10 seconds better)

  • Use DataIterLoader based on the mx.io.NDArrayIter as described here in the appendix, rather than the mx.gluon.data.DataLoader . This improved the speed to finally 20.17 seconds. No idea why the DataLoder is so much worse than the NDArrayIter.

Well I’m surprised that helped, anyways…

I think symbol is still faster than gluon due to a very important reason.
In gluon training session we are only printing out training cost, while in symbolic training session we are printing “training cost, eval cost and time cost” at each iteration. And this does really effect performance. Try only calculating training cost of symbolic code while training, like in gluon.

I’d just like to know what do you do, are you in college or something?

Gluon API uses C++ optimised operators too. You’re essentially queueing operations from Python for processing using the C++ backend. And with hybridized Gluon models, a symbolic graph gets created just as it would with Module API. So there’s essentially no difference here.

Ok I got your point, would you please tell me what is static_alloc and static_shape, and how they help gluon models when we hybridize to speed up?

Just to cross reference for others, I tested out this theory using code on this thread and it had a minimal effect on training performance after adding the accuracy calculations. So Gluon API and Module API were the same speed in this test.

So static_alloc=True enables upfront and fixed allocation of memory for the intermediate results of operations, which removes the overhead of having to allocate memory as you pass through the network. Clearly this is more optimal in terms of speed but can use more memory. And static_shape applies additional optimisations for cases when the input shape doesn’t change frequently.

@szha might be able to share some more specifics on this.

2 Likes