Hi, I’d like to average gradients and update batches every N minibatches, to handle batches larger than GPU memory. I’m using this forum post, but details are lacking and I don’t manage to make it work
let’s take an example with N = 3.
In order to know when to aggregate gradients and update weights, I’m maintaining a batch_counter that is incremented at every batch.
First, I configure the net this way:
for p in net.collect_params().values():
if p.grad_req != 'null':
p.grad_req = 'add'
Then at every mini-batch I run this:
if batch_counter == 3:
trainer.step(3)
for p in net.collect_params().values():
p.zero_grad()
batch_counter = 0
Doing this doesn’t train the SSD. Loss and mAP are erratic. When I do this accumulation every 1 batch (classical minibatch thing, without accumulation) then it trains correctly.
Can someone explain me how to make gradient accumulation work in MXNet? there needs to be a better tutorial for this, given how useful and important that feature is.
I’m tempted to do the same thing as what is done for multi-gpu, eg along the lines of:
# loop through epochs
for e in range(3): # 3 epochs
# loop through logical batches (DataLoader done at macro-batch scale)
for (data, label) in train_data:
# split in microbatches (GPU-level batch)
data = nd.split(data, num_outputs=4, axis=0)
label = nd.split(label, num_outputs=4, axis=0)
# compute losses
for D, L in zip(data, label):
# copy data to device
D = D.as_in_context(ctx)
L = L.as_in_context(ctx)
with autograd.record():
output = net(D)
loss = SCE(output, L)
# backprop
loss.backward()
accuracy.update(L, output)
trainer.step(batch) # using full batch (eg 4*GPU batch in this case)
can you please post your complete code? Looking at on old version of my code, this is the forward_backward step that is working (parallel gpu computing). Note that my models are not pre-trained (therefore I cannot think where I would see a grad_req=null value), and I never had to check if the initial grad_req is null as you do. Here _nbatch = total batch size, _data, _label are from gluon.split_and_load function (are a list of nd.arrays).
delay_rate = 8 # delay rate for averaging the gradients
def forward_backward_step(_iteration, _nbatch, _net, _data, _label):
with autograd.record():
# First argument is PREDICTIONS, Second LABELS
losses = [ (SomeLossFunction(_net(inputs),labels)) for inputs, labels in zip(_data, _label)]
# This is outside the autograd.record state
for l in losses: # Evaluate gradients in each ctx
l.backward()
# This updates gradients across ALL devices, by first aggregating them. <3 Gluon!
if (_iteration % delay_rate == 0):
trainer.step(_nbatch * delay_rate)
for param in _net.collect_params().values():
param.zero_grad()
return losses
This is used in something like:
mynet = #SomeNetDefinition
Nbatch = batch_pre_gpu * len(ctx) # That is, total batch size in all available GPUs
for idx, (data, label) in enumerate(SomeDataLoader):
data = gluon.utils.split_and_load(data,ctx)
label = gluon.utils.split_and_load(label,ctx)
losses = forward_backward_step( idx, Nbatch, mynet, data, label)
# do other stuff/monitoring etc.
If you can please post a working part of your code (even multigpu) I can test/help.