How to accumulate gradients over multiple mini-batches in Keras-MXNet

I am working with very large volumetric data, such that I can only fit 8 samples in one batch. However, I only want to accumulate the gradients after 32 or 64 samples. I used this implementation ( successfully before with Keras + Tensorflow, but when I try it with Keras + MXNet, I get the following error:

File "...\lib\site-packages\mxnet\module\", line 523, in init_optimizer assert isinstance(optimizer, opt.Optimizer) AssertionError

How can this be solved?

One can define custom optimizers in Keras, but it does not work with MXNet as backend, because the optimizer is overridden with an MXNet specific optimizer. This is probably the reason why you get this error.

Thank you for your answer. Is there no way to add custom optimizers that can work with the MXNet backend?

When looking at mxnet/optimizer/, I found the code for the Adam optimizer:

class Adam(Optimizer):
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
                 lazy_update=True, **kwargs):
        super(Adam, self).__init__(learning_rate=learning_rate, **kwargs)
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.lazy_update = lazy_update

    def create_state(self, index, weight):
        stype = weight.stype if self.lazy_update else 'default'
        return (zeros(weight.shape, weight.context, dtype=weight.dtype,
                      stype=stype),  # mean
                zeros(weight.shape, weight.context, dtype=weight.dtype,
                      stype=stype))  # variance

    def update(self, index, weight, grad, state):
        assert(isinstance(weight, NDArray))
        assert(isinstance(grad, NDArray))
        lr = self._get_lr(index)
        wd = self._get_wd(index)

        t = self._index_update_count[index]
        coef1 = 1. - self.beta1**t
        coef2 = 1. - self.beta2**t
        lr *= math.sqrt(coef2)/coef1

        kwargs = {'beta1': self.beta1, 'beta2': self.beta2, 'epsilon': self.epsilon,
                  'rescale_grad': self.rescale_grad}
        if self.clip_gradient:
            kwargs['clip_gradient'] = self.clip_gradient

        mean, var = state
        adam_update(weight, grad, mean, var, out=weight,
                    lazy_update=self.lazy_update, lr=lr, wd=wd, **kwargs)

How would it be possible to change this in order to accumulate the gradients over several mini-batches before updating?