Why is the gradient of softmax 0 on the input a? Intuitively, I interpret that as “no matter how I change a, it has no impact on the output of softmax” What is wrong with this interpretation of the code output? I assume that MXNET is computing the gradients correctly.
I’m pretty sure this is right. Note that you’re taking the gradient of all softmax outputs. In an autograd system, when you take the gradient of multiple outputs, what you’ll end up doing is the equivalent of adding all the elements together and taking the gradient of that sum.
What’s going on, is if you look at the gradients using each individual element and sum them up, they’ll sum up to zero. The intuitive explanation is that because softmax ensures all outputs sum to one, the gradients from each individual output cancel each other out. If you write out the math of softmax gradient, you can make a more convicing argument for that. However, if you simply want to test it, try looking at the gradient from each output.
Here’s a simplified example:
a = mx.nd.array([[0.1, 2]])
a.attach_grad()
with mx.autograd.record():
sm = mx.nd.softmax(a, axis=1)
sm_0 = sm[(0, 0)]
sm_0.backward()
grad_0 = a.grad.copy()
with mx.autograd.record():
sm = mx.nd.softmax(a, axis=1)
sm_1 = sm[(0, 1)]
sm_1.backward()
grad_1 = a.grad.copy()
print(grad_0)
print(grad_1)
print(grad_0 + grad_1)