How to make gradient accumulation work in MXNet?

Hi, I’d like to average gradients and update batches every N minibatches, to handle batches larger than GPU memory. I’m using this forum post, but details are lacking and I don’t manage to make it work

let’s take an example with N = 3.
In order to know when to aggregate gradients and update weights, I’m maintaining a batch_counter that is incremented at every batch.

First, I configure the net this way:

for p in net.collect_params().values():
    if p.grad_req != 'null':
        p.grad_req = 'add'

Then at every mini-batch I run this:

if batch_counter == 3:
    for p in net.collect_params().values():
        batch_counter = 0

Doing this doesn’t train the SSD. Loss and mAP are erratic. When I do this accumulation every 1 batch (classical minibatch thing, without accumulation) then it trains correctly.

Can someone explain me how to make gradient accumulation work in MXNet? there needs to be a better tutorial for this, given how useful and important that feature is.

I’m tempted to do the same thing as what is done for multi-gpu, eg along the lines of:

# loop through epochs
for e in range(3):  #  3 epochs
    # loop through logical batches (DataLoader done at macro-batch scale)
    for (data, label) in train_data:

        # split in microbatches (GPU-level batch)
        data = nd.split(data, num_outputs=4, axis=0)
        label = nd.split(label, num_outputs=4, axis=0)
        # compute losses
        for D, L in zip(data, label):
            # copy data to device
            D = D.as_in_context(ctx)
            L = L.as_in_context(ctx)

            with autograd.record():
                output = net(D)
                loss = SCE(output, L)

            # backprop
            accuracy.update(L, output)
    trainer.step(batch)  # using full batch (eg 4*GPU batch in this case)

Thoughts on this approach? looks correct?


can you please post your complete code? Looking at on old version of my code, this is the forward_backward step that is working (parallel gpu computing). Note that my models are not pre-trained (therefore I cannot think where I would see a grad_req=null value), and I never had to check if the initial grad_req is null as you do. Here _nbatch = total batch size, _data, _label are from gluon.split_and_load function (are a list of nd.arrays).

delay_rate = 8 # delay rate for averaging the gradients 
def forward_backward_step(_iteration, _nbatch, _net, _data, _label):
    with autograd.record():
        # First argument is PREDICTIONS, Second LABELS 
        losses = [ (SomeLossFunction(_net(inputs),labels)) for inputs, labels in zip(_data, _label)]

    # This is outside the autograd.record state 
    for l in losses: # Evaluate gradients in each ctx 

    # This updates gradients across ALL devices, by first aggregating them. <3 Gluon!
    if (_iteration % delay_rate == 0):
        trainer.step(_nbatch * delay_rate)   
        for param in _net.collect_params().values():

    return losses

This is used in something like:

mynet = #SomeNetDefinition
Nbatch = batch_pre_gpu * len(ctx) # That is, total batch size in all available GPUs
for idx, (data, label) in enumerate(SomeDataLoader):
    data  = gluon.utils.split_and_load(data,ctx)
    label = gluon.utils.split_and_load(label,ctx)
   losses = forward_backward_step( idx, Nbatch, mynet, data, label)
   # do other stuff/monitoring etc. 

If you can please post a working part of your code (even multigpu) I can test/help.