I have two containers:
- One running Python 2 and MXNet 1.1
- An updated container running Python 3 and MXNet 1.4
I have observed some significant performance regressions in the py3-MXNet 1.4.1 container, which is built with MKLDNN enabled.
I am using code at this repo as a ‘minimal reproducible example’: https://github.com/opringle/multivariate_time_series_forecasting
I used to profiler in each version to capture the second training batch for both containers, in a manner like this:
i = 0 for batch in train_iter: start_time = time.time() if i==1: profiler.set_state('run') module.forward(batch, is_train=True) module.backward() mx.nd.waitall() if i==1: profiler.set_state('stop') profiler.dump()
This is the profiler output when sorted by total op time for the py2-1.1 container:
Same for the Py3-1.4.1 container: (uploading as a reply due to new-user restriction)
Some ops like
backward_Convolution are significantly slower. My machine CPU is a 6-core Intel i7.
Does anyone know if this operator specific, or know a method to determine if it is? Is this issue related to MKL-DNN somehow?
Due to how our code is currently structured in my org, it’s quite difficult to upgrade to 1.5+.
When I run the same example with the same containers on a machine with an Intel Xeon CPU (c5 instance on AWS), the opposite occurs: the py3-1.4.1 container is much faster per batch than the py2-1.1 container.