Performance of distributed training using dist_sync kv_store


I’m using mxnet-mkl to run distributed mxnet training on several cpu nodes.
I’m referring to this page for distributed training.
My current status is: If I use 8 nodes, 4 workers and 4 servers, each worker/server is placed on one node using half of the vcores of that node, then the specified cores for workers are fully occupied and I suppose this is great.

To seek for improvement, I have some questions:
If I put one worker and one server on the same node (in this case I can save half of the resources), each using half of the vcores of that node, then the throughput of each work seems to be impacted quite much. Am I doing anything wrong?
Is there any best practice to achieve good performance results while fully use the available CPU resources?
Are there any official performance stats for distributed training?
Is there any best practice to use how many servers (not necessarily equal to the number of workers)?
When using to launch the distributed training, I suppose worker and server are treated equally. But with regard to using cores, does server need to use equivalent cores as worker? Any suggestions for this?

Thank you so much in advance!

After setting environment variables to bind cores, seems this problem can be solved.