Automatic Differentiation

https://d2l.ai/chapter_preliminaries/autograd.html

Can’t understand the meaning of head gradient.
Why give it a nd.array([10, 1., .1, .01])?
What do I get in x.grad if I don’t pass the head gradient to the backward function? Isn’t it dz/dx?

Hi @vermicelli,

I think backward should be applied to y here not z, that would make more sense to me.

And then the example should show a case where you could calculate dz/dy manually (possibly even not using mxnet), and still be able to use autograd for dy/dx to calculate dz/dx which is stored in x.grad as you pointed out.

Something like this example:

import mxnet as mx

x = mx.nd.array([0.,1.,2.,3.])
x.attach_grad()

with mx.autograd.record():
    y = x * 2

# dy/dz calculated outside of autograd
dydz = mx.nd.array([10, 1., .1, .01])
y.backward(dydz)
# thus calculating dz/dx, even though dz/dx was outside of autograd
x.grad
[20.    2.    0.2   0.02]
<NDArray 4 @cpu(0)>

@mli @smolix please confirm? Quite a complex example for an intro. Are there many use cases of this you’ve seen in the wild?

Thank you for your reply. This makes sense to me. But I think the ‘dy/dz’ in the comment # dy/dz calculated outside of autograd should be ‘dz/dy’. My understanding of your example is that you let the MXNet do the autograd on dy/dx which should be 2, and told autograd you already have the dz/dy part manually which is [10, 1., .1, .01]. Then autograd store the dz/dy * dy/dx in x.grad as the final result. Am I right?

So the “head gradient” here just means the gradient of some calculation chains which don’t get recorded by autograd.

@vermicelli, I think you are correct. The last example here implies that head_gradient is calculated outside of autograd. I think example implies that this head_gradient is actualy gradient of some other function w(z) that is missing. That head_gradient is actually dw/dz. I would put that into comments block in the code just to clarify this piece a bit more. Other than that I think your understanding is correct.

Hi,

Thank you for this book. This is my first step in DL, and even though I manage to understand, some exercises are really interesting but out of my reach. The 2nd bidder is one of them, even though I’d love to know the answer. Are you providing the solution somewhere, please?

@thomelane @vermicelli I rewrote this chapter, hope it’s clearer now. Preview at http://en.d2l.ai.s3-website-us-west-2.amazonaws.com/chapter_crashcourse/autograd.html

Hi @mli,
In the section Head Gradients, why do we calculate x.grad and y.grad after z.backward()? I guess it should be v.grad. This v.grad will be pass as input while we find du/dx. Correct me if I am wrong.

Hi @mli , I am quite confused about the example in the section Head Gradients. As the function z = x * y + x, so that dz / dx = y + 1. According to the content says x.grad should be dz / dx and vector y = [2, 2, 2, 2], the x.grad should be [3, 3, 3, 3] rather than [2, 2, 2, 2]. I am not sure if I misunderstand this part.

Hi @mli,

In section Detach Computation, in sentence " The following backward computes \partial u^2 x/\partial x with 𝑢=𝑥 instead of \partial x^3/\partial x.", should it be u\partial x/\partial x with u=x^2 rather than \partial u^2 x/\partial x ?

Thanks

I have the same question. I think it is just a typo.

With the following example I would expect x.grad to be [10, 24, 42, 64] but using head gradients as per the documentation gives me [5, 12, 21, 32]

from mxnet import ndarray as nd
from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
v = u.detach()
v.attach_grad()
z = v * x
ag.set_recording(False)
z.backward()
u.backward(v.grad)
print(x.grad, y.grad)

But when I do it without using head gradients like as follows I get the correct gradients -

from mxnet import autograd as ag
x = nd.array([1,2,3,4])
x.attach_grad()
y = nd.array([5,6,7,8])
y.attach_grad()

ag.set_recording(True)
u = x * y
z = u * x
ag.set_recording(False)
z.backward()
print(x.grad, y.grad)

Could someone please clarify here? I would expect the first code snippet and second code snippet to behave similarly. But it is not. Am I missing something here?

I guess you are right! It should be \partial u x/\partial x.

There’s a mistake in this section. In the first code segment (below), we’re computing the partial derivative \frac{\partial z}{\partial u} = 1, not the total derivative \frac{dz}{du} = 1 + \frac{dx}{du} = 1 + \frac{1}{y}:

y = np.ones(4) * 2
y.attach_grad()
with autograd.record():
    u = x * y
    v = u.detach()  # u still keeps the computation graph
    v.attach_grad()
    z = v + x
z.backward()
print(x.grad, '\n', y.grad)

So in the second segment (below), we’re computing the product \frac{du}{dx} . \frac{\partial z}{\partial u} = y * 1 = y = [2 ,2, 2, 2]:

u.backward(v.grad)
print(x.grad, '\n', y.grad)

Whereas by the chain rule, \frac{dz}{dx} = \frac{\partial z}{\partial u} . \frac{du}{dx} + \frac{\partial z}{\partial x} . \frac{dx}{dx} = 1 * y + 1 = y + 1

which is the same as applying the chain rule over total derivatives:

\frac{dz}{dx} = \frac{dz}{du} . \frac{du}{dx} = (1 + \frac{1}{y}) . y = y + 1

which is the same as directly differentiating:

\frac{dz}{dx} = \frac{d(xy + x)}{dx} = y + 1

and whose value is [3, 3, 3, 3], not [2, 2, 2, 2].

1 Like

Did anyone do question 4 from the assignment? How do you write final price as a function of the highest price?

Assume f(x) = sin(x). Plot f(x) and df(x)
dx on a graph
how i do this???

1 Like

Hi @mli, I have a (hopefully simple) question regarding the functionality of the backwards gradient with intermediate variables.

from mxnet import autograd, np, npx
npx.set_np()
x = np.arange(3)
x.attach_grad()
with autograd.record():
    z = 2 * x
    z.attach_grad() # If this is commented out, x.grad won't be zero
    y = np.dot(z, z)
y.backward()
print(f"x=>{x.grad} z=>{z.grad}") # This prints x=>[0. 0. 0.] z=>[ 0.  8. 16.]

So, why x.grad is zero when the z.attach_grad() is included and in fact z.grad contains the vaules that x.grad is supossed to have? If the attac_grad() cant be used in a record context, how to obtain the gradients respect to x and z in this simple case?

Thank you very much

Hi @gpolo, before answer your questions, the output of the final line in your code should be:
x=>[0. 0. 0.] z=>[0. 4. 8.].

Why x.grad is zero when the z.attach_grad() is included and in fact z.grad contains the values that x.grad is supossed to have?
When we call z.attach_grad(), it implicitly calls z= z.detach(). So the algorithm will treat z = [0,2,4] as values rather than z=2x as a function from x.

If the attac_grad() can’t be used in a record context, how to obtain the gradients respect to x and z in this simple case?
You can add a new variable w = z.copy() as shown below:

from mxnet import autograd, np, npx
npx.set_np()
x = np.arange(3)
x.attach_grad()
with autograd.record():
    z = 2 * x
    w = z.copy()
    z.attach_grad() # If this is commented out, x.grad won't be zero
    y = np.dot(z, z)
y.backward()
w.backward()
print(f"x=>{x.grad} z=>{z.grad}")

The answer will be: x=>[2. 2. 2.] z=>[0. 4. 8.].

Hi @gold_piggy!

Thank you very much for your response! The detach() implicit call was the key to understand the zeros on the x.grad.
But the thing I would like to do was to obtain \frac{\partial y}{\partial \vec x} and \frac{\partial y}{\partial \vec z} not \frac{\partial \vec z}{\partial \vec x} that is the one obtained with the w.backward call.
So the thing is how to obtain both, with respect the intermediate (z) and withe respect the inputs (x).
In this case \frac{\partial y}{\partial \vec z} = 2 \vec z = 4 \vec x and \frac{\partial y}{\partial \vec x} = (\frac{\partial y}{\partial \vec z})^T \cdot\ \frac{\partial \vec z}{\partial \vec x} = 8 \vec x , is this possible?

Thank you very much again

Hi,
First of all, thanks for this very helpful source.

For the simple example, when I change
with autograd.record():
y = 2 * np.dot(x, x)
as
with autograd.record():
y = 2 * np.dot(x, x) + 2 * x
I expect x.grad to be [2., 6., 10, 16.] , but the result is [ 2., 18., 34., 50.].
When I change to
with autograd.record():
y = 2 * x * x + 2 * x
I can get x.grad as I expected.
I will be very pleased if you can explain why.