How to accumulate gradients over multiple mini-batches in Keras-MXNet

sdbonte · August 26, 2019, 11:43am

I am working with very large volumetric data, such that I can only fit 8 samples in one batch. However, I only want to accumulate the gradients after 32 or 64 samples. I used this implementation (https://stackoverflow.com/questions/55268762/how-to-accumulate-gradients-for-large-batch-sizes-in-keras) successfully before with Keras + Tensorflow, but when I try it with Keras + MXNet, I get the following error:

File "...\lib\site-packages\mxnet\module\module.py", line 523, in init_optimizer assert isinstance(optimizer, opt.Optimizer) AssertionError

How can this be solved?

NRauschmayr · August 26, 2019, 6:18pm

One can define custom optimizers in Keras, but it does not work with MXNet as backend, because the optimizer is overridden with an MXNet specific optimizer. This is probably the reason why you get this error.

sdbonte · August 27, 2019, 7:36am

Thank you for your answer. Is there no way to add custom optimizers that can work with the MXNet backend?

sdbonte · August 29, 2019, 9:21am

When looking at mxnet/optimizer/optimizer.py, I found the code for the Adam optimizer:

class Adam(Optimizer):
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
                 lazy_update=True, **kwargs):
        super(Adam, self).__init__(learning_rate=learning_rate, **kwargs)
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.lazy_update = lazy_update

    def create_state(self, index, weight):
        stype = weight.stype if self.lazy_update else 'default'
        return (zeros(weight.shape, weight.context, dtype=weight.dtype,
                      stype=stype),  # mean
                zeros(weight.shape, weight.context, dtype=weight.dtype,
                      stype=stype))  # variance

    def update(self, index, weight, grad, state):
        assert(isinstance(weight, NDArray))
        assert(isinstance(grad, NDArray))
        self._update_count(index)
        lr = self._get_lr(index)
        wd = self._get_wd(index)

        t = self._index_update_count[index]
        coef1 = 1. - self.beta1**t
        coef2 = 1. - self.beta2**t
        lr *= math.sqrt(coef2)/coef1

        kwargs = {'beta1': self.beta1, 'beta2': self.beta2, 'epsilon': self.epsilon,
                  'rescale_grad': self.rescale_grad}
        if self.clip_gradient:
            kwargs['clip_gradient'] = self.clip_gradient

        mean, var = state
        adam_update(weight, grad, mean, var, out=weight,
                    lazy_update=self.lazy_update, lr=lr, wd=wd, **kwargs)

How would it be possible to change this in order to accumulate the gradients over several mini-batches before updating?

Topic		Replies	Views
How to make gradient accumulation work in MXNet? Gluon	2	1122	October 21, 2019
Possible Memory Leak Performance	2	2374	February 6, 2020
How to speed up the train of neural network model with mxnet? Performance	12	3074	August 10, 2018
How to use all available cores in mxnet Performance	1	552	September 28, 2018
Memory profiling for MxNet Performance	4	1462	October 11, 2017

How to accumulate gradients over multiple mini-batches in Keras-MXNet

Related Topics