 # Implementation of sigmoid extending mx.autograd.Function

The example in the python API document shows an implementation of sigmoid which supports autograd.

``````class sigmoid(Function):
def forward(self, x):
y = 1 / (1 + mx.nd.exp(-x))
self.save_for_backward(y)
return y

def backward(self, dy):
# backward takes as many inputs as forward's return value,
# and returns as many NDArrays as forward's arguments.
y, = self.saved_tensors
return y * (1-y)
``````

I think the `backward` method must return `dy * y * (1-y)` instead of `y * (1-y)`, doesn’t it?

Yes. I think it should add `dy`. Thank you for pointing this. Would you like to open a PR to fix this: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L375

Why would one add a `dy` term? The derivative of sigmoid w.r.t. `x` is correctly calculated to be `y * (1-y)`, where y = y(x).

@feevos It is because of the chain rule. For example, the following code procudes a wrong result without the `dy` factor.

``````x = mx.nd.array([1,2,3])
f = sigmoid()
y = f(x)
z = y * y
z.backward()
print(x.grad) #[0.28746966 0.18495613 0.08606823] is the right result.
``````

I opened the issue #9872 for this example in the doc.

1 Like

Thank you @dotelos, I verified the calculation you propose. Wow! So there is a significant difference between the derivative in pen and paper, and how it is done computationally. In addition I found that when one calculates `y.backward()` the value of dy defaults to `[1.,1.,1.]`.

Do you have any good tutorial/reference to propose (in any deep learning framework), that describes explicitly the differences between theoretical functional forms of derivatives and how these are implemented inside a software library - using a computational graph? I am a bit confused as to what exactly the variable `dy` represents in the definition of the `backward` function.

@feevos I’m not aware of any reference. However, basically it is just a chain rule.

The point is that what `f.backward(dy)` actually calculates is not the derivative of `f`. It is the derivative of some unknown function where `f` is composed into. In the above case, `f.backward(dy)` must calculate the derivative of `f(x)^2` instead of `f` itself. The implementation of `f.backward` does not know what the final function to be. However, in any case, it is reduced to a form of the derivative of `g(f(x))` where `g` is the unknown function defined at runtime (x^2 in the above case or it could be some complex composition of functions). Then the derivative is `g'(f(x)) f'(x)` by the chain rule. The autograd module calculates `dy=g'(f(x))` and give it to `f.backward` and the implementation of `backward` returns `g'(f(x)) f'(x)` = `dy f'(x)`. The implementation can calculate it because it knows its own derivative `f'(x)` and given `dy`. How does the autograd module calculates `dy`? It is just a recursion. The implementation of `backward` of every operator, including `*` in the above case, takes `dy`. So, for example, let `z(x) = f(g(h(x)))`. Then autograd module does the following when `z.backward()` is called.

``````dy = f.backward(1)
dy = g.backward(dy)
dy = h.backward(dy)
I’m not sure that it is explained well. Anyway the point is that, for any function `f`, what `f.backward` calculates is the derivative of `g(f(x))` and not `f(x)` itself. Then whatever `g` is given, the result is `g'(f(x)) f(x)` and `g'(f(x))` is calculated with a recursive application of the same rule. The essence is the same for multivariable functions but we need vectors and matrices instead of just numbers.