Cuda malloc when going distributed

feevos · April 9, 2019, 8:02am

I think I found the source of the error, leaving it here for reference. I think it relates to issue #14136. What I did to resolve it is add these two lines in the gluon.data.DataLoader

# Load the training data
train_data = gluon.data.DataLoader(dataset_train,
                                   batch_size,
                                   sampler=SplitSampler(len(dataset_train), store.num_workers, store.rank),
                                   # *****************************
                                   pin_memory=True,
                                   pin_device_id = store.rank,
                                   # *******************************
                                   last_batch='discard',
                                   num_workers = num_cpus)

# Load the test data 
test_data = gluon.data.DataLoader(dataset_val,
                                  batch_size_per_gpu,
                                  shuffle=False,
                                  last_batch='discard',
                                  # ******** new test ************
                                   pin_memory=True,
                                   pin_device_id = store.rank,
                                  # *******************************
                                  num_workers = num_cpus)

I basically pinned the memory and gave a different rank for each worker (I think!). I don’t know how this will work when going to the validation phase, we’ll see. But I can train, without cuda malloc error (without horovod at the moment, getting there …).

Topic		Replies	Views
Gluon Multi GPU Out of Memory Issues	6	3416	April 11, 2019
Kvstore for distributed multi-gpu training Performance	10	2735	November 16, 2017
Single-node multi-gpu machine Gluon	3	1286	October 13, 2018
Correct way to train Sequential() model on GPU Gluon	6	1131	February 10, 2021
Training on gpu(1) and gpu(2) allocates some memory on gpu(0) Gluon	3	564	June 6, 2018

Cuda malloc when going distributed

Related Topics