Parallelize Operators

So I am trying to parallelize execution of the code across multiple cores of my cpu. When I set MXNET_CPU_WORKER_NTHREADS it actually decreases the effective number of cores I’m using even though I am setting that variable to less than the number of total cores that I have. It almost appears that setting this variable turns off the parallelization inside of NNpack, OMP, and similar.

I was wondering if 1) I need to set another variable to let MXNet run its operators across cores. 2) If calling asnumpy(), or some other blocking call inside of the network automatically blocks the entire network (as opposed to just blocking the part of the network that call depends upon), and 3) if there are other obvious places to check when trying to understand why MXNet is using neither the memory I have available nor the CPU cores I have available.

If I instead set MXNET_CPU_PRIORITY_NTHREADS, it does not seem to slow anything down. However, it also does not seem to speed anything up.