Access of GPU array very slow

Hi I have followed the tutorial on using Yolo object detector for my application.
I want to process the frames of a full HD image.
I can process each image in about 0.1 s (large image), but when I try to access the class_IDs or score outputs from the net (class_IDs, scores, bounding_boxs = net(x)) the time required is around 0.5 s! I am just looping over the first 10 entries of class_IDs to check for a particular target.
I assume it is because these are gpu arrays. I have tried with cpu only, and the time is even slower.

However, when I do something similar with a numpy array, the time taken is about 0.01 s.

I have also tried converting class_IDs to a numpy array first using .asnumpy but the overall time is about the same.

I am using Windows. Is this something peculiar to Windows? Is there any way to rapidly check the entries of class_IDs so that I can process and identify objects in multiple images per second?

Thank you.

I can reproduce the same waiting time, if I pass quite big image.

The thing that you encounter is not really about the speed of access to GPU memory, but it is just an asynchronous nature of Apache MXNet. Once you call net(x) the control flow is returned back to you immediately, but the calculation is not done yet - it is started to execute in asynchronous way. But when you try to access class_IDs then this request has to be synchronous - MXNet has to wait till calculation is done before actually allowing you to access class_IDs.

I noticed that if I have multiple images of exactly same size, then the first image processed significant time, but the second is not taking that much time. This happens because for every input image size MXNet does best convolution algorithm search - this slows down speed significantly for the first time an image of new size passed.

What you should do is make sure that your images are of the same height and width - that would increase speed considerably, though the first image processing time would still be long. You can change height and width of images by, for example, doing resizing of the image in tranformation function: x, img = data.transforms.presets.yolo.load_test(im_fname, short=1024) - here it would be resized in a way that the smallest size would be 1024 pixels.

If it is important for you that the first image would be processed faster and you are fine with later images processed in the same speed, then you can set environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0. If you do so the best convolution algorithm search won’t happen and you will have consistent (but worse) performance across all your runs.


Thanks Sergey for your response. Much appreciated. That makes sense.

Actually I have only processed 1 frame of the sequence, but the intention is to process many frames.

Practically speaking, is there another way to handle this? Is there a way of knowing (programmatically) when an image has been processed (completed)?

I’m thinking that I could access class_IDs for frame n only after frame (n+m) has been processed. Thus, there would be a delay equivalent to m iterations before I try to access class_IDs from m frames ago. The value of m should be large enough to ensure that object detection processing on frame n is completed by iteration (n+m).

I hope this sort of makes sense!