Speedup inference / multithread inference Python


I read some comments about the multithread inference and generally is not good news .

how can I use the full power of the cpus of the edge machine to submit multiple input for inferencing?

Each get_feature(inference) took almost 1 sec in raspberry and I need to wait 5 seconds to get 5th input image while cpu is only %15 - 20 .

as you know we have limited with 1gig Ram. what is the best approach to get multiple inference at one time? (python)

MXNet’s engine is already running multihthreaded, unless you turned off OpenMP, MKL, Blas.

In order to speed up the inference you could have a look at NNPACK https://github.com/apache/incubator-mxnet/blob/master/docs/faq/nnpack.md NNPACK is an acceleration package for neural network computations. It can run on x86-64, ARMv7, or ARM64 architecture CPUs and can speed up execution on multi-core CPUs.

You could try TensorRT runtime integration in MXNet, however this is currently still an experimental feature. https://cwiki.apache.org/confluence/display/MXNET/How+to+use+MXNet-TensorRT+integration

You could try to compile your MXNet model with TVM: https://docs.tvm.ai/tutorials/nnvm/from_mxnet.html which can speedup the inference.

It is also important to check whether there are performance bottlneck such as I/O. How do you load your data for inference? Is it stored in a file, if so which file format are you using?

The input image coming from mqtt to directly to my python code in raspberry pi 3b+.

After that extracting features and compering the distance. I am not sure about the openMP , an blas. I compiled with blas and openMP . I will check again.

extracting feature looks one thread. comparing the distance of the result feature is fast.

i will look tvm and let you know.



is there any way to see current mxnet compiled details whether include above libs?

Did you install MXNet with CMake? If so you could check CMakeCache.txt. Alternatively you could also do ldd libmxnet.so. If MXNet was compiled with OpenMP, then the library will show up. You should see a line like libomp.so => /usr/local/lib/libomp.so.

You mentioned that the input image is coming from mqtt directly. I could imagine that you may have some delays there e.g. waiting for the next message, then preprocessing of the image and postprocessing of the result. One way of optimizing this is to have a separate reader process, that gathers all the incoming messages and creates a batch of images, that you can feed into your model. If the distance computation is not part of your model, then this could also be done in a separate process.

@NRauschmayr . result is :

ldd /home/pi/berryconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
linux-vdso.so.1 (0x7ed7d000)
/usr/lib/arm-linux-gnueabihf/libarmmem.so (0x74b09000)
libgfortran.so.3 => /usr/lib/arm-linux-gnueabihf/libgfortran.so.3 (0x74a33000)
libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x742a3000)
librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0x7428c000)
libstdc++.so.6 => /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 (0x74144000)
libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0x740c5000)
libgomp.so.1 => /usr/lib/arm-linux-gnueabihf/libgomp.so.1 (0x7408d000)
libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0x74060000)
libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0x74037000)
libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0x73ef8000)
/lib/ld-linux-armhf.so.3 (0x76fa0000)
libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0x73ee5000)

The Python Global Interpreter Lock or GIL, in simple words, is a mutex (or a lock) that allows only one thread to hold the control of the Python interpreter. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads. What the GIL prevents then, is making use of more than one CPU core or separate CPUs to run threads in parallel.

Python threading is great for creating a responsive GUI, or for handling multiple short web requests where I/O is the bottleneck more than the Python code. It is not suitable for parallelizing computationally intensive Python code, stick to the multiprocessing module for such tasks or delegate to a dedicated external library. For actual parallelization in Python, you should use the multiprocessing module to fork multiple processes that execute in parallel (due to the global interpreter lock, Python threads provide interleaving, but they are in fact executed serially, not in parallel, and are only useful when interleaving I/O operations). However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.