Slow speed of mallocing gpu memory using mxnet built from source

when using the mxnet built from source, the mxnet doesn’t start training directly,it mallocs gpu memory slowly(observed by nvidia-smi)。no such problem when using pip-version mxnet.

test example: mxnet/example/ctc
hardware: tesla p4( i will test on 1080 later)

get a solution from:

but still slower than pip-verison.