Batchnorm gradient

I have a network consisting only of a batchnorm layer. The gradient that I get for batchnorm0_gamma after running a backward pass is different than the one I computed manually. I detailed my work in this link:
https://colab.research.google.com/github/x110/DLToolboxImg/blob/master/BatchNormMxnet.ipynb

Please advise.

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))
x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma']

The output I get is:
( [-1.0274856] <NDArray 1 @cpu(0)>, [0.] <NDArray 1 @cpu(0)>)

When calculating the gradient manually, the output i get is:

xi = x.asnumpy()
a = np.mean(xi)
b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-5)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn)

(-1.0274856090545654, -2.127872943878174)

The first gradient is same but the second is not.

Hi,

The reason why the gradients are different is because when you use the BatchNorm operator you need to specify fix_gamma=False to make gamma learnable as it is by default set to true. See https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.BatchNorm for more info.

Changing your code slightly to include that gives the right answers:

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source, fix_gamma=False)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
#print exe
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))

x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y

#print x, t
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
print(exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma'])


xi = X.asnumpy()
a = np.mean(xi)

b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-3)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
print(2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn))

Prints out

    (
    [-1.0274855]

    <NDArray 1 @cpu(0)>, 

    [-4.123798]

    <NDArray 1 @cpu(0)>)

    ('-1.0274854898452759', '-4.12379789352417')
2 Likes