Derivative of Softmax

adrian · December 22, 2018, 11:45am

Hi,

I do not understand the output of this short code:

a = nd.array([[[[0.1]] ,[[2]]]])
a.shape
(1, 2, 1, 1)
a.attach_grad()
with mx.autograd.record():
… a_rslt = nd.softmax(a, axis =1)
…
a_rslt.backward()
a_rslt

[[[[0.13010848]]

[[0.8698916 ]]]]
<NDArray 1x2x1x1 @cpu(0)>
a.grad

[[[[0.]]

[[0.]]]]
<NDArray 1x2x1x1 @cpu(0)>

Why is the gradient of softmax 0 on the input a? Intuitively, I interpret that as “no matter how I change a, it has no impact on the output of softmax” What is wrong with this interpretation of the code output? I assume that MXNET is computing the gradients correctly.

Thanks in advance for any reply

jmacglashan · December 24, 2018, 10:59pm

I’m pretty sure this is right. Note that you’re taking the gradient of all softmax outputs. In an autograd system, when you take the gradient of multiple outputs, what you’ll end up doing is the equivalent of adding all the elements together and taking the gradient of that sum.

What’s going on, is if you look at the gradients using each individual element and sum them up, they’ll sum up to zero. The intuitive explanation is that because softmax ensures all outputs sum to one, the gradients from each individual output cancel each other out. If you write out the math of softmax gradient, you can make a more convicing argument for that. However, if you simply want to test it, try looking at the gradient from each output.

Here’s a simplified example:

a = mx.nd.array([[0.1, 2]])
a.attach_grad()
with mx.autograd.record():
    sm = mx.nd.softmax(a, axis=1)
    sm_0 = sm[(0, 0)]
sm_0.backward()
grad_0 = a.grad.copy()

with mx.autograd.record():
    sm = mx.nd.softmax(a, axis=1)
    sm_1 = sm[(0, 1)]
sm_1.backward()
grad_1 = a.grad.copy()

print(grad_0)
print(grad_1)
print(grad_0 + grad_1)

What you get is:

[[ 0.11318026 -0.11318026]]
<NDArray 1x2 @cpu(0)>

[[-0.11318026  0.11318021]]
<NDArray 1x2 @cpu(0)>

[[-7.4505806e-09 -5.2154064e-08]]
<NDArray 1x2 @cpu(0)>

Notice that they cancel each other out (within numerical precision)

Topic		Replies	Views
Differentiating specific softmax output label with respect to input image Discussion	1	787	October 11, 2017
Automatic Differentiation D2L Book	22	3083	December 13, 2019
Questions about loss functions Discussion	1	368	June 7, 2019
Odd behavior of mx.sym.SoftmaxOutput	2	633	July 25, 2018
Batchnorm gradient Discussion	1	613	October 22, 2018

Derivative of Softmax

Related Topics