I ran some quick informal benchmark on my code. Runtimes are in seconds.
Intel® Core™ i7-4700MQ CPU @ 2.40GHz
installation inference runtime (s)
pip install mxnet-mkl 0.8
pip install mxnet 1.3
manual install (atlas, openmp) 4.2
manual install (atlas, lapack, openmp) 4.2
manual install (atlas) 4.4
manual install (openblas) 10.8
manual installation = build from source (master)
pip mxnet = 1.1.0
I am surprised by the order of magnitude difference between pip install mxnet
(which uses openblas
) and manual installation with openblas
.
Am I missing some obvious compilation flags? This was meant to be an informal benchmark, but I can get some reproducible code and control for mxnet version if there is a need for debugging.
relevant
Hi @insilico,
I have unfortunately not been able to reproduce your issue:
Here is my benchmark code:
import mxnet as mx
print(mx.__version__, mx.__file__)
import time
from mxnet.gluon.model_zoo import vision
resnet18 = vision.resnet18_v1(pretrained=True)
data = mx.nd.ones((16, 3, 224, 224))
tick = time.time()
for i in range(10):
resnet18(data).wait_to_read()
print("{0:.4f}".format(time.time()-tick))
- mxnet 1.1.0: 10s
- mxnet-mkl 1.1.0: 2s
- mxnet-mkl --pre: 1.2s
- mxnet --pre: 8s
Locally built:
- latest master:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_MKLDNN=1
1.1s
- latest master:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas
8s
Which is consistent with the pip installed version.
I am wondering whether your issue might come from your locally installed openblas ?
Hey @ThomasDelteil
Thanks a lot for looking into this. I took your benchmark code and produced the following results:
Ubuntu 17.10 with 4.13.0-39-generic
Intel® Core™ i7-4700MQ CPU @ 2.40GHz
time mxnet
10.6s 1.1.0 PyPI mxnet
15.7s 1.2.0 built with libopenblas-dev (0.2.20)
31.4s 1.2.0 built with libatlas-base-dev (3.10.3-5)
Both libopenblas-dev
and libatlas-base-dev
come from Ubuntu repository, reinstalled fresh for the above benchmark. Considering that I am building mxnet according to the official build instructions, I still find the discrepancy above surprising.
I could try to compile openblas from source to see if Ubuntu’s libopenblas-dev
is at fault. Any other hints?
Edit 1: I have tried with openblas built from source, which gives the identical result as Ubuntu’s libopenblas-dev
.
Edit 2: Is your local openblas built with openmp? Is PyPI mxnet (libmxnet.so
) statically linked with openblas that itself is built with openmp?
This is my configuration:
32-cores Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
OpenBLAS 0.2.20
Ubuntu 16.04.3 LTS
libmxnet.so
comes statically linked with openblas
Openmp seems dynamically linked:
> ldd libmxnet.so
...
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f7b35500000)
@yizhiliu @szha any ideas?
From @szha, possible suspects:
- PyPi version comes with debug options off
- OpenBlas is compiled with the following flags: DYNAMIC_ARCH=1 NO_SHARED=1 USE_OPENMP=1
@ThomasDelteil
Thanks for your continued support. Your tips helped me confirm my hypothesis about openmp, and I have resolved the performance differences.
I realized that PyPI mxnet does not respect environment variables OMP_NUM_THREADS
or MXNET_CPU_WORKER_NTHREADS
. On my machine, the benchmark code always runs CPU@200% with PyPI mxnet.
MXNET with libopenblas-dev
or manual installation of openblas respects ‘OMP_NUM_THREADS’ (as it should per openblas’ documentation) and when `OMP_NUM_THREADS’ is not defined, it uses $(nproc) threads (as documented in openblas).
I did not set OMP_NUM_THREADS
, so the benchmark ran CPU@800% (8 logical cores on Intel i7). CPU@800% ran slower than CPU@200%, hence the observed discrepancy above.
Edit:
I should note that OMP_NUM_THREADS=2
makes manual installation of mxnet as fast as the PyPI version, and OMP_NUM_THREADS=4
makes it slightly faster on the benchmark code.
1 Like
@insilico, no problem. Thanks for sharing your findings! I’ll investigate on my side, I find it strange that the PyPI mxnet would not be respecting these ENV variables.