`MXImperativeInvokeEx` is taking a long time

Hey guys. As shown in the image, MXImperativeInvokeEx is taking a long time. I wonder what it probably does.
I use the profiler to profile my whole program, from data loading to gradient updating.
Also there are blank stages in between the processes. I suspect that’s the data loading process.
Sorry I may not have described my problem concretely, since I’m new to MXNet. This program is originally written in PyTorch and recently I rewrite it using MXNet gluon (also using HybridBlock), with most things remain the same. But the PyTorch version is 10 times faster than this MXNet version. I’m going nowhere for the solution.
I’m eager to find someplace where I can chat with people instantly about the problem so that I can give the details.

Can you send a reproducible example? Which version of MXNet, cuda, cudnn and which OS are you running on?
One thing to keep in mind is, that MXNet does some optizmiation in the beginning that can take some time. You can enable/disable it by setting MXNET_CUDNN_AUTOTUNE_DEFAULT.

Yes. If you don’t mind, this is my repository.
The related part is located in mx_hico/roi_mil.
And it uses a wrapper I write.
The related part is located in mx_wrapper/mx_wrapper.
Since I don’t know which part causes the problem, I can’t give a tiny reproducible sample and I’m sorry for that.
I’ve turned MXNET_CUDNN_AUTOTUNE_DEFAULT off. I’m using MXNET 1.3.1, cuda 8.0, cudnn 5.1.3 on Ubuntu 14.04 with Titan X.
Thank you for you reply.

Thank you for your reply

I tried with a simple sample to reflect the problem.
For pytorch:

import torch
from torch import nn
import time
import torchvision.models as models

resnet = models.resnet50(pretrained=True)
resnet = nn.DataParallel(resnet, [0, 1], output_device=0)
data = torch.ones(8, 3, 224, 224)
data = data.cuda(0)
tick = time.time()

For mxnet:

import mxnet as mx
from mxnet import autograd, gluon, nd
import time
from gluoncv import model_zoo

ctx = [mx.gpu(0), mx.gpu(1)]
resnet = model_zoo.resnet50_v1(pretrained=True, ctx=ctx)
data = mx.nd.ones([8, 3, 224, 224])
splitted = gluon.utils.split_and_load(data, ctx_list=ctx)
for _data in splitted:
tick = time.time()
with autograd.record():
    for _data in splitted:


And the result is:

pytorch: ~0.03s
mxnet: ~0.06s

I tried to reproduce the performance numbers: One problem I see in your example is that you are not iterating over multiple examples. This means your GPU is likely under-utilized. I would suggest to have a warm-up phase of 10 iterations and then the main benchmark loop with a 100 or more iterations. I run your example with these modifications and MXNet was then slightly faster.

Thanks a lot for sharing the link to your repository. I will have a look on the code and see if I can find the problem, why MXImperativeInvokeEx is taking a long time.

I tried to run your code https://git.dev.tencent.com/hyesun1832/mx_hico.git but it is missing the input dataset. Where can I find the dataset?

Yes, you can download it here:
Just place it in the DATA_DIR that your config file specifies.

Hey man. Seems that I’ve found the problem. In my loss computation, I need to apply weighting. And the weight is generated dynamically according to the labels. So I generate the weight in the loss function, which causes the speed lag.
I modify my pipeline, where the weight generation happens in the dataset, so that I can utilize the multiprocess worker loop.
I don’t know whether this has something to do with the MXImperativeInvokeEx, but the modification does speed up my program.
Further more I updated my cuda (to 9.2) and cudnn (to 7), and this also seems to help.
Thanks for your help.