Batchnorm gradient

x110 · October 21, 2018, 8:41am

I have a network consisting only of a batchnorm layer. The gradient that I get for batchnorm0_gamma after running a backward pass is different than the one I computed manually. I detailed my work in this link:
https://colab.research.google.com/github/x110/DLToolboxImg/blob/master/BatchNormMxnet.ipynb

Please advise.

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))
x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma']

The output I get is:
( [-1.0274856] <NDArray 1 @cpu(0)>, [0.] <NDArray 1 @cpu(0)>)

When calculating the gradient manually, the output i get is:

xi = x.asnumpy()
a = np.mean(xi)
b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-5)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn)

(-1.0274856090545654, -2.127872943878174)

The first gradient is same but the second is not.

sad · October 22, 2018, 9:57pm

Hi,

The reason why the gradients are different is because when you use the BatchNorm operator you need to specify fix_gamma=False to make gamma learnable as it is by default set to true. See https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.BatchNorm for more info.

Changing your code slightly to include that gives the right answers:

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source, fix_gamma=False)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
#print exe
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))

x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y

#print x, t
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
print(exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma'])


xi = X.asnumpy()
a = np.mean(xi)

b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-3)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
print(2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn))

Prints out

    (
    [-1.0274855]

    <NDArray 1 @cpu(0)>, 

    [-4.123798]

    <NDArray 1 @cpu(0)>)

    ('-1.0274854898452759', '-4.12379789352417')

Topic		Replies	Views
Backward of mxnet's network with BatchNorm doesn't have gradient in input layer but has gradient without BatchNorm MXNet Model Server	1	384	June 13, 2019
Mxnet batchnorm with symbol API Discussion	8	734	March 1, 2019
Understanding Autograd.backward() with custom parameters for specific layers Discussion	3	779	September 17, 2019
Gradient fetching Discussion	2	586	May 31, 2018
Differentiating specific softmax output label with respect to input image Discussion	1	785	October 11, 2017

Batchnorm gradient

Related Topics