I found that when i try to use dataloader, it only costs memory of gpu-0. When num-workers is large. Is it possible to distribute the memory cost to evenly to all the gpus used for training?
Hi @sumuwk
Take a look at split_and_load as described here to split your data across several contexts:
GPU_COUNT = 2 # increase if you have more
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
net.collect_params().initialize(ctx=ctx)
from mxnet.test_utils import get_mnist
mnist = get_mnist()
batch = mnist['train_data'][0:GPU_COUNT*2, :]
data = gluon.utils.split_and_load(batch, ctx)
print(net(data[0]))
print(net(data[1]))
[[-0.01876061 -0.02165037 -0.01293943 0.03837404 -0.00821797 -0.00911531 0.00416799 -0.00729158 -0.00232711 -0.00155549] [ 0.00441474 -0.01953595 -0.00128483 0.02768224 0.01389615 -0.01320441 -0.01166505 -0.00637776 0.0135425 -0.00611765]] <NDArray 2x10 @gpu(0)> [[ -6.78736670e-03 -8.86893831e-03 -1.04004676e-02 1.72976423e-02 2.26115398e-02 -6.36630831e-03 -1.54974898e-02 -1.22633884e-02 1.19591374e-02 -6.60043515e-05] [ -1.17358668e-02 -2.16879714e-02 1.71219767e-03 2.49827504e-02 1.16810966e-02 -9.52543691e-03 -1.03610428e-02 5.08510228e-03 7.06662657e-03 -9.25292261e-03]] <NDArray 2x10 @gpu(1)>