The running time of MXPredForward and MXPredGetOutput

The running time of C API MXPredForward is much shorter than the running time of MXPredGetOutput:

auto start = std::chrono::high_resolution_clock::now();
MXPredForward(pred_hnd);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
LOGI("MXPredForward: %d microseconds.", duration.count());
std::vector<float> data(size);
start = std::chrono::high_resolution_clock::now();
MXPredGetOutput(pred_hnd, output_index, &(data[0]), static_cast<mx_uint>(size));
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
LOGI("MXPredGetOutput: %d microseconds.", duration.count());

The result is:

I/MXNET: MXPredForward: 106 microseconds.
I/MXNET: MXPredGetOutput: 3748967 microseconds.

Why? Is it something related to lazy evaluation?
The code runs on Pixel3 with Snapdragon 835.

Hi yizhao,

MXPredForward is an asynchronous call, it just start processing your input data through the network.

MXPredGetOutput requires that all operations are finished in order to extract the output data, so it blocks until the inference is finished.

regards,

Lieven

1 Like

Hi Lieven,
Thanks for your reply. It’s really helpful, I understand it now.
There is another question about measuring the running time of networks:
In python, I divide a network into two parts, say part1 and part2. The output of part1 is the input to part2. I want to measure their running time separately. I want to find out which part needs longer computation time.
Code version 1:

import time
# run part 1
start = time.time()
module_part1.forward(...)
output_1 = module_part1.get_outputs()
end = time.time()
time1 = end - start
# run part 2
start = time.time()
module_part2.forward(output_1, ...)
output_2 = module_part2.get_outputs()
print(output_2)
end = time.time()
time2 = end - start

Running the above code shows that time2 > time1.
However, if I run change the code to version 2:

import time
# run part 1
start = time.time()
module_part1.forward(...)
output_1 = module_part1.get_outputs()
if output_1[0][0][0][0] == 0: # Add this line, access output_1 in some way
  do_nothing = 1
end = time.time()
time1 = end - start
# run part 2
start = time.time()
module_part2.forward(output_1, ...)
output_2 = module_part2.get_outputs()
print(output_2)
end = time.time()
time2 = end - start

Running version 2 shows that time1 > time2.
I suspect that this is related to lazy evaluation?
Which version of code gives the correct running time of the two parts of the network?
Thanks very much!

Hi,

so both calls to forward will return immediately but will process the respective inputs asynchronously. So in both part 1 and part 2 you should wait for the output to be available.
In your version 2 you only wait for the output of part 1.

Assuming output_1 and output_2 are of type mxnet.ndarray, you can call output_1.wait_to_read() and output_2.wait_to_read(). If these are python lists containing mxnet.ndarray’s, just iterate over the list elements and invoke wait_to_read() and each of them.

Lieven

1 Like

Hi, Lieven
Thanks so much for your help! I understand it now.