Adding network gradient to the computational graph

I have a loss function that depends on the gradient of a neural network w.r.t. the network inputs (not the network parameters). However, I’m having trouble backpropagating the loss function’s parameter-gradients because mxnet doesn’t seem to think the input-gradient is part of the computation graph. Can someone help me debug? Here’s a MWE:

# Data
x = nd.array([0])
dydx = nd.array([1])

# Network
net = nn.Sequential()
with net.name_scope():
    net.add(nn.Dense(1))
net.collect_params().initialize(mx.init.Constant(1))

# Loss
l2_loss = gluon.loss.L2Loss()
    
x.attach_grad()
with mx.autograd.record():
    y = net(x)
    dydx_ = mx.autograd.grad(y, [x], retain_graph=True)[0]
    loss = l2_loss(dydx, dydx_)
loss.backward()

I get this error:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-11-de03bb6615f7> in <module>()
     17     dydx_ = mx.autograd.grad(y, [x], retain_graph=True)[0]
     18     loss = l2_loss(dydx, dydx_)
---> 19 loss.backward()

MXNetError: [09:21:50] src/imperative/imperative.cc:373:
Check failed: !AGInfo::IsNone(*i) 
Cannot differentiate node because it is not in a computational graph. 
You need to set is_recording to true or use autograd.record()
to save computational graphs for backward. 
If you want to differentiate the same graph twice,
you need to pass retain_graph=True to backward.

My question is: Why is dydx_ not considered part of the computational graph? It’s the derivative of net(x) w.r.t. to x and hence depends on network weights and biases. Shouldn’t it be extending the graph, or am I misunderstanding?

Your code looks right to me. The problem, however, is that the operators you’re using (e.g., Dense) do not have their derivative as a compute graph registered in mxnet.* Right now, very few operators in mxnet have that registered and consequently the ability to compute higher order gradients in mxnet is extremely limited.

There is some effort in place to register more of the derivative of common operations so that you can do this, but it’s been a bit stagnant. I too would like to see this feature better supported in mxnet, since it’s the one thing other libraries do better IMO.

Hopefully more will follow. Refer to this issue for more info:
https://issues.apache.org/jira/browse/MXNET-978

Here is a WIP PR for adding some more support (but not enough for common networks)

*Note that registering the derivative in the compute graph goes beyond defining the computation of the derivative that is required for regular single derivative autograd. Registering the derivative of an operation in the compute graph is how the autograd will know how to take derivatives of the derivative.

I see, thanks for the quick response. In poking around I actually saw MXNET-978 but was hopeful here because I’m computing two first-order partial derivatives: the first is w.r.t. the network input, the second is w.r.t. to network parameters. But to your point, it’s still a second-order derivative that requires the first to be registered in the compute graph.

I also agree with your statement that other libraries support this feature. I was actually able to implement this type of problem in TensorFlow but found their gradient computations to be painfully slow compared to MXNet’s, which is why I’m considering making the switch. (For reference, on my local CPU I found mx.autograd.grad
to be about 35-100x faster than tf.gradients). Hopefully someone will have time to work on this issue soon.

@exv - here’s the challenge: second derivatives tend to be much higher dimensional in general. This is why it’s hard to support them in general. Stay tuned. @piiswrong might have some more ideas about how to do this efficiently.