WGAN-gp: can't compute gradient penalty with gluon?

Dear all,

I’m working on the implementation of WGAN with gradient penalty, but I get an error when doing a backward step:

MXNetError: Operator _backward_Convolution is non-differentiable because it didn't register FGradient attribute.

I guess it goes wrong because of the computation of \left(|| \nabla net_c(x_m)||_2 - 1 \right)^2 term inside the loss function. If I remove this term from the loss function, the training loop works again. Did I do something wrong in the implementation with Gluon? What is the proper way to compute a loss function with second order derivative?

I made a minimal example to reproduce the error:

import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import nn

# Define and init dummy network.
net = nn.HybridSequential()
net.add(
    nn.Conv2D(in_channels=1, channels=64, kernel_size=4, strides=2, activation="relu"),
    nn.Conv2D(in_channels=64, channels=128, kernel_size=4, strides=2, activation="relu"),
    nn.Conv2D(in_channels=128, channels=1, kernel_size=4, strides=2)
)
net.initialize()

trainer = gluon.Trainer(net.collect_params(), "adam", {"learning_rate": 0.00002})

batch_size = 8
clambda = 10

# Do one training step
with autograd.record():
    xr = nd.random.randn(batch_size, 1, 28, 28)
    xf = nd.random.randn(batch_size, 1, 28, 28)
    epsilon = nd.ones(shape=(batch_size, 1, 1, 1)) * 0.5
    xm = epsilon * xr + (1 - epsilon) * xf
    xm.attach_grad()
    yr = net(xr)
    ym = net(xm)
    grad_ym = mx.autograd.grad(heads=ym, variables=[xm], retain_graph=True, create_graph=True)[0]
    grad_ym = grad_ym.reshape(batch_size, -1)
    loss = nd.mean(ym) - nd.mean(yr) + clambda * nd.mean((nd.norm(grad_ym, axis=1) - 1) ** 2)
    print("loss: ", loss)
loss.backward()
trainer.step(batch_size)