MXNet - Use Batch Norm for Input Scaling

dmadeka · March 26, 2019, 6:49am

If I use batchnorm with global stats and fix the gamma - that should be a reasonable approximation of feature normalization, except for the beta. Is there any way to fix the beta of BatchNorm as well?

thomelane · April 3, 2019, 6:24pm

Hi @dmadeka,

Correction: use beta.lr_mult=0 instead of center=False. See next answer for details.

Sure, that’s possible. You need to set center=False and scale=False on the BatchNorm layer if you want to zero-center the data, and scale to unit variance. Overall the effect will be similar to standard data input normalisation, but subtly different. Instead of using the true normalisation statistics of the training data (calculated across the whole dataset), the BatchNorm global stats will updated iteratively (often called the ‘running stats’). So there will certainly be differences at the start of training (since they are randomly initialised), and depending on your momentum parameter, the statistics maybe more skewed to the most recent batches of data (so it might be a good idea to increase this a little).

mx.gluon.nn.BatchNorm(center=False, scale=False, momentum=0.99)

dmadeka · April 3, 2019, 8:39pm

thomelane:

Sure, that’s possible. You need to set center=False and scale=False on the BatchNorm layer if you want to zero-center the data, and scale to unit variance. Overall the effect will be similar to standard data input normalisation, but subtly different. Instead of using the true normalisation statistics of the training data (calculated across the whole dataset), the BatchNorm global stats will updated iteratively (often called the ‘running stats’). So there will certainly be differences at the start of training (since they are randomly initialised), and depending on your momentum parameter, the statistics maybe more skewed to the most recent batches of data (so it might be a good idea to increase this a little).

That doesn’t really work unfortunately

thomelane · April 3, 2019, 10:48pm

My mistake! So checking again, it turns out that setting center=False and scale=False on mx.gluon.nn.BatchNorm also disables the initial zero-centering and unit variance scaling (before scaling by gamma and shifting by beta). You get all or nothing: e.g. both initial zero-centering and beta shifting or neither.

My first thought was to set .grad_req to null to avoid gradient calculation of beta and gamma but this once again disables the initial scaling and shifting too. Given this, my recommendation would be to get the learning rate multiplers for beta and gamma to 0 using lr_mult. So the running stats are still calculated but beta and gamma don’t change from 0 and 1 respectively (i.e. no beta shifting or gamma scaling).

class SimpleNet(gluon.nn.HybridBlock):
    def __init__(self,**kwargs):
        super(SimpleNet, self).__init__(**kwargs)

        with self.name_scope():
            self.bn = gluon.nn.BatchNorm()
            self.bn.beta.lr_mult = 0
            self.bn.gamma.lr_mult = 0
            self.dense = gluon.nn.Dense(1)

    def hybrid_forward(self, F, x):
        x1 = self.bn(x)
        x2 = self.dense(x1)
        return x1, x2

Topic		Replies	Views
Mxnet batchnorm with symbol API Discussion	8	734	March 1, 2019
Train model with no bias in convolution layer Gluon	6	1065	December 24, 2018
Question about batch normalization Discussion	4	1856	April 24, 2018
Proper usage of BatchNorm during inference? Discussion python , gluon , docs	5	3699	February 8, 2019
How to initialize a BatchNorm with existing weights? Discussion	0	292	January 22, 2020

MXNet - Use Batch Norm for Input Scaling

Related Topics