Hello again,
I finally managed to make mxnet+horovod work together (woohoooo - for now). Some questions:
- what is the operational difference when it comes to
trainer.step(SomeBatchSize)
betweenhorovod.mxnet.DistributedTrainer
andmxnet.gluon.Trainer
? I read in the documentation (but I cannot really understand how this affects the usage) that:
# DistributedTrainer, a subclass of MXNet gluon.Trainer.
# There are two differences between DistributedTrainer and Trainer:
# 1. DistributedTrainer calculates gradients using Horovod allreduce
# API while Trainer does it using kvstore push/pull APIs;
# 2. DistributedTrainer performs allreduce(summation) and average
# while Trainer only performs allreduce(summation).
The first thing I note, based on comparison of the horovod mnist example is that in horovod we are using step(batch_per_worker_rank)
, i.e. the batch inside the step function is the one that is being seen by a single process. In the “standard” case (with the kvstore), assuming e.g. we had a single node with 4 gpus, we were using the total batch size (with all gpus), not just the batch_size per process. This is evident e.g. in the distributed training example with parameter-server.
# Train a batch using multiple GPUs
def train_batch(batch, ctx, net, trainer):
# Split and load data into multiple GPUs
data = batch[0]
data = gluon.utils.split_and_load(data, ctx)
# Split and load label into multiple GPUs
label = batch[1]
label = gluon.utils.split_and_load(label, ctx)
# Run the forward and backward pass
forward_backward(net, data, label)
# Update the parameters
this_batch_size = batch[0].shape[0]
trainer.step(this_batch_size)
So in comparison with the kvstore, I see that in hvd we use hvd.trainer.step(batch_size_per_gpu)
, while kv.trainer.step(total_batch_size_over_all_gpus_for_a_single_node)
. Right?
-
Can I still use the method of delayed gradient update (grad_req=‘add’ etc), in combination with horovod?
-
If I want to load optimizer states
trainer.load_states(filename_states)
, do I need to broadcast them to all other workers the same way I do with model parameters?
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
- In the definition of the
DataLoader
, for pinned device memory, I assume I must usehvd.rank()
, nothvd.local_rank()
, right?
Edit in issue 14136 user yuxihu mentions that we should usehvd.local_rank()
. Could please someone explain this a bit more? Doeskv.store.rank()
provides the “same” ID ashvd.local_rank()
data_generator = gluon.data.DataLoader(some_dataset,
batch_size = batch_per_gpu,
sampler = SplitSampler(len(some_dataset),
hvd.size(),
hvd.rank),
pin_memory=True,
pin_device_id = hvd.rank(),
last_batch = 'discard',
num_workers = 0)
Thank you very much for your time.