Gradients for Embedding layers in Gluon

irina_nicolae · September 20, 2018, 6:14pm

I am working with Gluon models from users (ie, I’m not defining the architecture myself) that potentially contain word embeddings. I need to compute the gradient of the loss function w.r.t. the inputs. Now, I know that embeddings are discrete, thus non-differentiable. Is it possible to compute the gradients w.r.t. the embeddings vectors themselves (ie, the outputs of the Embedding layer)? Any help is much appreciated.

safrooze · September 20, 2018, 9:37pm

Here is an example:

net = gluon.nn.Embedding(1000, 50)
net.initialize()

x = nd.cast(nd.clip(nd.random_uniform(0, 1000, shape=(100,)), 0, 999), 'int32')
with autograd.record():
    emb = net(x)  # emb.shape=(100, 50)
    out = nd.mean(emb)  # replace nd.mean with some loss calculation
emb.backward()
emb_grad = net.weight.grad()

irina_nicolae · September 21, 2018, 11:58am

@safrooze, thank you for your example! It works well! I do have a follow-up question. With your solution, the shape of the obtained gradient matches that of the embedding (ie, (vocab_size, embedding_dims), but the batch size of the input disappears (I suppose gradients are summed over inputs in this case). Is there a way to get these same gradients w.r.t. each input sample, resulting in something of shape (batch_size, vocab_size, embedding_dims)? Thanks again!

safrooze · September 21, 2018, 5:12pm

The memory required for collecting gradients of each parameter is allocated during network initialization and it’s size is equal to the size of the parameter. If you want to get separate gradients for different outputs, you’d have to do multiple backward calls and copy the gradients. Here is the updated example:

net = gluon.nn.Embedding(1000, 50)
net.initialize()

x = nd.cast(nd.clip(nd.random_uniform(0, 1000, shape=(100,)), 0, 999), 'int32')
with autograd.record():
    emb = net(x)  # emb.shape=(100, 50)
    out = nd.split(emb, 100, axis=0)  # replace nd.mean with some loss calculation
grads = list()
for o in out:
    o.backward(retain_graph=True)
    grads.append(net.weight.grad())
grads = nd.concat(*grads, dim=0)

Please note that with the above code, each backward() call replaces the gradients of the previous backward() call. If you intent to accumulate gradients in net.weight’s gradient NDArray to be used with optimizer, you’d need to set net.weight.grad_req='add' and keep in mind that each time you call backward, gradients are summed, so you’d have to subtract previous gradient value from current one to get the value of the current backward() pass.

Topic		Replies	Views
WGAN-gp: can't compute gradient penalty with gluon? Gluon	0	410	October 15, 2020
Adding network gradient to the computational graph Gluon	3	1645	December 17, 2018
Gradient fetching Discussion	2	586	May 31, 2018
Computing per-class gradients	5	651	August 16, 2018
How to implement the addtion of grad in the backback-propagating,how to add extra term (which is the gradient to middle net layer output) to the network	2	591	August 18, 2018

Gradients for Embedding layers in Gluon

Related Topics