Python multithread queue and multi process on CPU question


I am trying to extract futures and inferences on the CPU. I have mac book pro 8 cores and 16 Gig ram.

mxnet: 1.3.1
mxnet-mkl : 1.3.1
python 3.6

in my python code :
os.environ[“MXNET_CPU_WORKER_NTHREADS”] = “4”
os.environ[“OMP_NUM_THREADS”] = “8”
os.environ[“MXNET_CPU_NNPACK_NTHREADS”] = “8”
os.environ[“MXNET_MP_OPENCV_NUM_THREADS”] = “1”

try to set environment variable

I have two question

1 - In my mac I can utilize only 2-3 cores instead of 8 . How can I say to mxnet use all possible cores ? How can I use all power of my cpus ?

2 - I am using multithread queue and want to inference in 4 thread simultaneously . It I use 1 thread all is ok.

When I increase it 2 or 4 it gives below error time to time . but in 1 thread there is no error in same image inference.

<class ‘mxnet.base.MXNetError’>, MXNetError(’[13:45:17] src/operator/contrib/…/tensor/…/elemwise_op_common.h:133: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node at 0-th output: expected [1,3,20,35], got [1,3,198,360]\n\nStack trace returned 10 entries:\n[bt] (0) 0 0x0000000111601b90 + 15248\n[bt] (1) 1 0x000000011160193f + 14655\n[bt] (2) 2 0x0000000111601569 + 13673\n[bt] (3) 3 0x000000011173d1c2 + 1307074\n[bt] (4) 4 0x000000011173ce1f + 1306143\n[bt] (5) 5 0x0000000111737f94 + 1286036\n[bt] (6) 6 0x0000000112b485da MXNDListFree + 502922\n[bt] (7) 7 0x0000000112b470a4 MXNDListFree + 497492\n[bt] (8) 8 0x0000000112aa441e MXCustomFunctionRecord + 20926\n[bt] (9) 9 0x0000000112aa5140 MXImperativeInvokeEx + 176\n\n’,), <traceback object at 0x14f788148>)


  1. Do you mean your mac only utilizes 2-3 cores during inference? The cpu cores are also used during data loading so there may be some contention there. The environment variables you set seem to be the right ones for controlling the number of worker threads for the engine which controls how many independent operators can be executed in parallel.

  2. you can’t run inference with 4 threads simultaneously because the mxnet engine is not thread safe. You could explore having multiple processes creating the input data but you have to use a single input queue to the computational engine. There’s some more info in this github issue.