Hi all,
I have two containers:
- One running Python 2 and MXNet 1.1
- An updated container running Python 3 and MXNet 1.4
I have observed some significant performance regressions in the py3-MXNet 1.4.1 container, which is built with MKLDNN enabled.
I am using code at this repo as a ‘minimal reproducible example’: https://github.com/opringle/multivariate_time_series_forecasting
I used to profiler in each version to capture the second training batch for both containers, in a manner like this:
i = 0
for batch in train_iter:
start_time = time.time()
if i==1:
profiler.set_state('run')
module.forward(batch, is_train=True)
module.backward()
mx.nd.waitall()
if i==1:
profiler.set_state('stop')
profiler.dump()
This is the profiler output when sorted by total op time for the py2-1.1 container:
Same for the Py3-1.4.1 container: (uploading as a reply due to new-user restriction)
Some ops like backward_Convolution
are significantly slower. My machine CPU is a 6-core Intel i7.
Does anyone know if this operator specific, or know a method to determine if it is? Is this issue related to MKL-DNN somehow?
Other context:
Due to how our code is currently structured in my org, it’s quite difficult to upgrade to 1.5+.
When I run the same example with the same containers on a machine with an Intel Xeon CPU (c5 instance on AWS), the opposite occurs: the py3-1.4.1 container is much faster per batch than the py2-1.1 container.