Slow speed in multi-GPU data loading

I found copying data from CPU to multiple GPU very slow using split_and_load. In experiment below, it seems that loading ~420MB of data takes 12.6 seconds. This is a EC2 P3 8Xlarge box with 4 GPUs, 32CPUs, and 240GB memory.

Is this a problem with the API or am I using the function incorrectly?

Thanks!

import time
from mxnet import nd, autograd, gluon, init, gpu, cpu
from mxnet.gluon.utils import split_and_load

batch_size = 1024
data = nd.random.uniform(shape=(25600, 32, 128))
devices = [gpu(0), gpu(1), gpu(2), gpu(3)]

step = int(batch_size/(len(devices)))
start = list(range(0, batch_size, step))
end = [s+step for s in start]

t1 = time.time()
for epoch in range(1):
    for batch in range(0, len(data), batch_size):                   
        data_batch_mgpu = split_and_load(data[batch:batch+batch_size], devices)
#nd.waitall()
t2 = time.time()
print("total time = {:2.1f} seconds".format(t2-t1))

It does not save your time for the first iteration. The cost in your test are mostly for creating cuda streams for each device. And they are happened in serial. It takes more time if you enable more device. But, the these streams will be reused later on, in your test, when you run more epochs instead of just one.

1 Like

You are right. I tested repeated reading and it seems much faster.

Thanks!

1 Like