Dataloader cost too much gpu-0 memory

I found that when i try to use dataloader, it only costs memory of gpu-0. When num-workers is large. Is it possible to distribute the memory cost to evenly to all the gpus used for training?

Hi @sumuwk
Take a look at split_and_load as described here to split your data across several contexts:

GPU_COUNT = 2 # increase if you have more
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx)

from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))
   [[-0.01876061 -0.02165037 -0.01293943  0.03837404 -0.00821797 -0.00911531
       0.00416799 -0.00729158 -0.00232711 -0.00155549]
     [ 0.00441474 -0.01953595 -0.00128483  0.02768224  0.01389615 -0.01320441
      -0.01166505 -0.00637776  0.0135425  -0.00611765]]
    <NDArray 2x10 @gpu(0)>

    [[ -6.78736670e-03  -8.86893831e-03  -1.04004676e-02   1.72976423e-02
        2.26115398e-02  -6.36630831e-03  -1.54974898e-02  -1.22633884e-02
        1.19591374e-02  -6.60043515e-05]
     [ -1.17358668e-02  -2.16879714e-02   1.71219767e-03   2.49827504e-02
        1.16810966e-02  -9.52543691e-03  -1.03610428e-02   5.08510228e-03
        7.06662657e-03  -9.25292261e-03]]
    <NDArray 2x10 @gpu(1)>