I think I found the source of the error, leaving it here for reference. I think it relates to issue #14136. What I did to resolve it is add these two lines in the gluon.data.DataLoader
# Load the training data
train_data = gluon.data.DataLoader(dataset_train,
batch_size,
sampler=SplitSampler(len(dataset_train), store.num_workers, store.rank),
# *****************************
pin_memory=True,
pin_device_id = store.rank,
# *******************************
last_batch='discard',
num_workers = num_cpus)
# Load the test data
test_data = gluon.data.DataLoader(dataset_val,
batch_size_per_gpu,
shuffle=False,
last_batch='discard',
# ******** new test ************
pin_memory=True,
pin_device_id = store.rank,
# *******************************
num_workers = num_cpus)
I basically pinned the memory and gave a different rank for each worker (I think!). I don’t know how this will work when going to the validation phase, we’ll see. But I can train, without cuda malloc error (without horovod at the moment, getting there …).