Mxnet vs numpy incredible slow

Code:

import numpy as np
from mxnet import nd

def gen(size, step, count):
    r = np.empty(shape=(count, size), dtype='int32')    
    for i in range(count):
        r[i] = np.arange(i, i + size * step, step)
    return r

def gen_mx(size, step, count):
    r = nd.empty(shape=(count, size), dtype='int32')    
    for i in range(count):
        r[i] = nd.arange(i, i + size * step, step)
    r.wait_to_read()
    return r
%timeit gen(1000, 2, 100000)
473 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit gen_mx(1000, 2, 100000)
41.8 s ± 2.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

What am I doing wrong? Why is it so slow compared with numpy?

CPU load (while running mxnet):

Hi Melvin,

I’m not very familiar with the details of NDArray, but vectorization will be faster than plain for loop.

Here is an example:

def gen_mx2(col_size, step, row_size):
    A = nd.arange(row_size, dtype='int32').reshape(row_size, 1).broadcast_axes(axis=1, size=col_size)
    B = nd.broadcast_axes(nd.arange(col_size, dtype='int32').reshape(1, col_size) * step, axis=0, size=row_size)
    r = nd.elemwise_add(A, B)
    r.wait_to_read()
    return r

To avoid ambiguity, I changed the argument names, but you can use some small numbers to check they are equivalent.

And on my laptop (cpu), the results are:

%timeit gen(1000, 2, 100000)
538 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit gen_mx2(1000, 2, 100000)
2.06 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can see that mxnet.nd is still slow than numpy, but not that much as in your previous test.

Since there are no real tensor calculation in this demo benchmark, only array initialization, I guess that mxnet is slow for some extra memory copy? Maybe. We can do some other test, like matrix multiplication.