Distributed training questions

feevos · May 30, 2019, 6:29am

Hello again,

I finally managed to make mxnet+horovod work together (woohoooo - for now). Some questions:

what is the operational difference when it comes to trainer.step(SomeBatchSize) between horovod.mxnet.DistributedTrainer and mxnet.gluon.Trainer? I read in the documentation (but I cannot really understand how this affects the usage) that:

# DistributedTrainer, a subclass of MXNet gluon.Trainer.
# There are two differences between DistributedTrainer and Trainer:
# 1. DistributedTrainer calculates gradients using Horovod allreduce
#    API while Trainer does it using kvstore push/pull APIs;
# 2. DistributedTrainer performs allreduce(summation) and average
#    while Trainer only performs allreduce(summation).

The first thing I note, based on comparison of the horovod mnist example is that in horovod we are using step(batch_per_worker_rank), i.e. the batch inside the step function is the one that is being seen by a single process. In the “standard” case (with the kvstore), assuming e.g. we had a single node with 4 gpus, we were using the total batch size (with all gpus), not just the batch_size per process. This is evident e.g. in the distributed training example with parameter-server.

# Train a batch using multiple GPUs
def train_batch(batch, ctx, net, trainer):

    # Split and load data into multiple GPUs
    data = batch[0]
    data = gluon.utils.split_and_load(data, ctx)
    
    # Split and load label into multiple GPUs
    label = batch[1]
    label = gluon.utils.split_and_load(label, ctx)

    # Run the forward and backward pass
    forward_backward(net, data, label)

    # Update the parameters
    this_batch_size = batch[0].shape[0]
    trainer.step(this_batch_size)

So in comparison with the kvstore, I see that in hvd we use hvd.trainer.step(batch_size_per_gpu), while kv.trainer.step(total_batch_size_over_all_gpus_for_a_single_node). Right?

Can I still use the method of delayed gradient update (grad_req=‘add’ etc), in combination with horovod?
If I want to load optimizer states trainer.load_states(filename_states), do I need to broadcast them to all other workers the same way I do with model parameters?

params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

In the definition of the DataLoader, for pinned device memory, I assume I must use hvd.rank(), not hvd.local_rank(), right?
Edit in issue 14136 user yuxihu mentions that we should use hvd.local_rank(). Could please someone explain this a bit more? Does kv.store.rank() provides the “same” ID as hvd.local_rank()

data_generator = gluon.data.DataLoader(some_dataset, 
    batch_size = batch_per_gpu,
    sampler = SplitSampler(len(some_dataset),
    hvd.size(), 
    hvd.rank),
    pin_memory=True, 
    pin_device_id = hvd.rank(),
    last_batch = 'discard',
    num_workers = 0)

Thank you very much for your time.

Topic		Replies	Views
Data parallelism for ConvLSTM Gluon	3	695	July 25, 2019
[gluon]How to do distribute training with internal implemented ps? Gluon	2	433	May 3, 2018
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	764	November 14, 2018
mxnet.gluon.data.DataLoader / RandomSampler shuffle Gluon	2	865	May 2, 2018
Gluon NLP Batchify Gluon	1	389	November 26, 2019

Distributed training questions

Related Topics