Automatic Differentiation

mli · November 27, 2018, 6:47pm

https://d2l.ai/chapter_preliminaries/autograd.html

vermicelli · January 11, 2019, 5:39pm

Can’t understand the meaning of head gradient.
Why give it a nd.array([10, 1., .1, .01])?
What do I get in x.grad if I don’t pass the head gradient to the backward function? Isn’t it dz/dx?

thomelane · January 11, 2019, 10:24pm

Hi @vermicelli,

I think backward should be applied to y here not z, that would make more sense to me.

And then the example should show a case where you could calculate dz/dy manually (possibly even not using mxnet), and still be able to use autograd for dy/dx to calculate dz/dx which is stored in x.grad as you pointed out.

Something like this example:

import mxnet as mx

x = mx.nd.array([0.,1.,2.,3.])
x.attach_grad()

with mx.autograd.record():
    y = x * 2

# dy/dz calculated outside of autograd
dydz = mx.nd.array([10, 1., .1, .01])
y.backward(dydz)
# thus calculating dz/dx, even though dz/dx was outside of autograd
x.grad

[20.    2.    0.2   0.02]
<NDArray 4 @cpu(0)>

@mli @smolix please confirm? Quite a complex example for an intro. Are there many use cases of this you’ve seen in the wild?

vermicelli · January 12, 2019, 8:43am

Thank you for your reply. This makes sense to me. But I think the ‘dy/dz’ in the comment # dy/dz calculated outside of autograd should be ‘dz/dy’. My understanding of your example is that you let the MXNet do the autograd on dy/dx which should be 2, and told autograd you already have the dz/dy part manually which is [10, 1., .1, .01]. Then autograd store the dz/dy * dy/dx in x.grad as the final result. Am I right?

So the “head gradient” here just means the gradient of some calculation chains which don’t get recorded by autograd.

ddavydenko · February 5, 2019, 5:35am

@vermicelli, I think you are correct. The last example here implies that head_gradient is calculated outside of autograd. I think example implies that this head_gradient is actualy gradient of some other function w(z) that is missing. That head_gradient is actually dw/dz. I would put that into comments block in the code just to clarify this piece a bit more. Other than that I think your understanding is correct.

SebastienCoste · May 15, 2019, 8:53pm

Hi,

Thank you for this book. This is my first step in DL, and even though I manage to understand, some exercises are really interesting but out of my reach. The 2nd bidder is one of them, even though I’d love to know the answer. Are you providing the solution somewhere, please?

mli · June 1, 2019, 12:12am

@thomelane @vermicelli I rewrote this chapter, hope it’s clearer now. Preview at http://en.d2l.ai.s3-website-us-west-2.amazonaws.com/chapter_crashcourse/autograd.html

Dhananjay · June 10, 2019, 8:28pm

Hi @mli,
In the section Head Gradients, why do we calculate x.grad and y.grad after z.backward()? I guess it should be v.grad. This v.grad will be pass as input while we find du/dx. Correct me if I am wrong.

hellkonig · June 28, 2019, 2:56pm

Hi @mli , I am quite confused about the example in the section Head Gradients. As the function z = x * y + x, so that dz / dx = y + 1. According to the content says x.grad should be dz / dx and vector y = [2, 2, 2, 2], the x.grad should be [3, 3, 3, 3] rather than [2, 2, 2, 2]. I am not sure if I misunderstand this part.

cloudliu01 · June 30, 2019, 7:45pm

Hi @mli,

In section Detach Computation, in sentence " The following backward computes \partial u^2 x/\partial x with 𝑢=𝑥 instead of \partial x^3/\partial x.", should it be u\partial x/\partial x with u=x^2 rather than \partial u^2 x/\partial x ?

Thanks

mru4913 · July 2, 2019, 10:02pm

I have the same question. I think it is just a typo.

anirudhacharya · July 22, 2019, 11:07pm

With the following example I would expect x.grad to be [10, 24, 42, 64] but using head gradients as per the documentation gives me [5, 12, 21, 32]

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
v = u.detach()
v.attach_grad()
z = v * x
ag.set_recording(False)
z.backward()
u.backward(v.grad)
print(x.grad, y.grad)

But when I do it without using head gradients like as follows I get the correct gradients -

from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
z = u * x
ag.set_recording(False)
z.backward()
print(x.grad, y.grad)

Could someone please clarify here? I would expect the first code snippet and second code snippet to behave similarly. But it is not. Am I missing something here?

gold_piggy · July 24, 2019, 1:45am

I guess you are right! It should be \partial u x/\partial x.

doos · August 21, 2019, 11:39am

There’s a mistake in this section. In the first code segment (below), we’re computing the partial derivative \frac{\partial z}{\partial u} = 1, not the total derivative \frac{dz}{du} = 1 + \frac{dx}{du} = 1 + \frac{1}{y}:

y = np.ones(4) * 2
y.attach_grad()
with autograd.record():
    u = x * y
    v = u.detach()  # u still keeps the computation graph
    v.attach_grad()
    z = v + x
z.backward()
print(x.grad, '\n', y.grad)

So in the second segment (below), we’re computing the product \frac{du}{dx} . \frac{\partial z}{\partial u} = y * 1 = y = [2 ,2, 2, 2]:

u.backward(v.grad)
print(x.grad, '\n', y.grad)

Whereas by the chain rule, \frac{dz}{dx} = \frac{\partial z}{\partial u} . \frac{du}{dx} + \frac{\partial z}{\partial x} . \frac{dx}{dx} = 1 * y + 1 = y + 1

which is the same as applying the chain rule over total derivatives:

\frac{dz}{dx} = \frac{dz}{du} . \frac{du}{dx} = (1 + \frac{1}{y}) . y = y + 1

which is the same as directly differentiating:

\frac{dz}{dx} = \frac{d(xy + x)}{dx} = y + 1

and whose value is [3, 3, 3, 3], not [2, 2, 2, 2].

Syed_Saad · August 29, 2019, 5:40pm

Did anyone do question 4 from the assignment? How do you write final price as a function of the highest price?

Fatma_Mahmoud · October 25, 2019, 5:33pm

Assume f(x) = sin(x). Plot f(x) and df(x)
dx on a graph
how i do this???

gpolo · November 26, 2019, 11:45am

Hi @mli, I have a (hopefully simple) question regarding the functionality of the backwards gradient with intermediate variables.

from mxnet import autograd, np, npx
npx.set_np()
x = np.arange(3)
x.attach_grad()
with autograd.record():
    z = 2 * x
    z.attach_grad() # If this is commented out, x.grad won't be zero
    y = np.dot(z, z)
y.backward()
print(f"x=>{x.grad} z=>{z.grad}") # This prints x=>[0. 0. 0.] z=>[ 0.  8. 16.]

So, why x.grad is zero when the z.attach_grad() is included and in fact z.grad contains the vaules that x.grad is supossed to have? If the attac_grad() cant be used in a record context, how to obtain the gradients respect to x and z in this simple case?

Thank you very much

gold_piggy · November 26, 2019, 7:34pm

Hi @gpolo, before answer your questions, the output of the final line in your code should be:
x=>[0. 0. 0.] z=>[0. 4. 8.].

Why x.grad is zero when the z.attach_grad() is included and in fact z.grad contains the values that x.grad is supossed to have?
When we call z.attach_grad(), it implicitly calls z= z.detach(). So the algorithm will treat z = [0,2,4] as values rather than z=2x as a function from x.

If the attac_grad() can’t be used in a record context, how to obtain the gradients respect to x and z in this simple case?
You can add a new variable w = z.copy() as shown below:

from mxnet import autograd, np, npx
npx.set_np()
x = np.arange(3)
x.attach_grad()
with autograd.record():
    z = 2 * x
    w = z.copy()
    z.attach_grad() # If this is commented out, x.grad won't be zero
    y = np.dot(z, z)
y.backward()
w.backward()
print(f"x=>{x.grad} z=>{z.grad}")

The answer will be: x=>[2. 2. 2.] z=>[0. 4. 8.].

gpolo · November 27, 2019, 9:22am

Hi @gold_piggy!

Thank you very much for your response! The detach() implicit call was the key to understand the zeros on the x.grad.
But the thing I would like to do was to obtain \frac{\partial y}{\partial \vec x} and \frac{\partial y}{\partial \vec z} not \frac{\partial \vec z}{\partial \vec x} that is the one obtained with the w.backward call.
So the thing is how to obtain both, with respect the intermediate (z) and withe respect the inputs (x).
In this case \frac{\partial y}{\partial \vec z} = 2 \vec z = 4 \vec x and \frac{\partial y}{\partial \vec x} = (\frac{\partial y}{\partial \vec z})^T \cdot\ \frac{\partial \vec z}{\partial \vec x} = 8 \vec x , is this possible?

Thank you very much again

nba · December 11, 2019, 7:43am

Hi,
First of all, thanks for this very helpful source.

For the simple example, when I change
with autograd.record():
y = 2 * np.dot(x, x)
as
with autograd.record():
y = 2 * np.dot(x, x) + 2 * x
I expect x.grad to be [2., 6., 10, 16.] , but the result is [ 2., 18., 34., 50.].
When I change to
with autograd.record():
y = 2 * x * x + 2 * x
I can get x.grad as I expected.
I will be very pleased if you can explain why.

Topic		Replies	Views
Implementation of sigmoid extending mx.autograd.Function Discussion	5	1365	February 26, 2018
Difference b/w loss.backward() and mx.autograd.backwars([loss]) Discussion	2	2357	May 14, 2019
Derivative of Softmax Discussion	1	731	December 24, 2018
Concise Implementation of Linear Regression D2L Book	11	1852	May 30, 2020
How to get gradients using symbol API	6	3240	June 11, 2019

Automatic Differentiation

Related Topics