When I try to use MXNet’s profiler while running distributed training it doesn’t log the events into the json file. I set these environment variables MXNET_PROFILER_AUTOSTART=1 MXNET_PROFILER_MODE=1
to start the profiler. This just creates a 16KB json file which contains a list of events like below. No actual operators are shown in the output.
traceEvents": [
{
"ph": "M",
"args": {
"name": "cpu/0"
},
"pid": 0,
"name": "process_name"
},
...
I’ve also tried to set the profiler like this ‘mx.profiler.profiler_set_state(‘run’)’. That doesn’t work too.
This happens even when the job was launched locally using launch.py with launcher local option. This was the command I used, if someone wants to recreate it:
cd example/image-classification && ../../tools/launch.py -n 1 --launcher local python train_imagenet.py --benchmark 1 --kv-store dist_sync --gpus 0,1,2,3 --network mlp --batch-size 256 --num-epochs 1
I’d like to note that I have compiled with USE_PROFILER=1 and can profile jobs which are run locally without launch.py
How to use the profiler correctly for distributed training?