Hi,

am not sure if what I will suggest will work 100%, but itâs easy to give it a try. If I were you, Iâd change my model to use gluon directly, from here.

To your example, I think you are missing a line in your model definition, this line:

```
for param in self.params:
param.attach_grad(grad_rec = 'add')
```

I would give it a try with the following modifications (based on the tutorial you followed).

modification in your model:

```
class GRU():
def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
if seed:
mx.random.seed(2018)
num_inputs = vocab_size
num_outputs = vocab_size
num_hidden = num_hidden
########################
# Weights connecting the inputs to the hidden layer
########################
self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
# Recurrent weights connecting the hidden layer across time steps
########################
self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
# Bias vector for hidden layer
########################
self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
self.bz, self.br, self.bh, self.Why, self.by]
# @@@@@@@@@@@ MODIFICATION HERE @@@@@@@@@@@
for param in self.params:
param.attach_grad(grad_req='add') # This tells mxnet to add the gradients
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
def forward(self, inputs, h, temperature=1.0):
outputs = []
for X in inputs:
z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
h = z * h + (1 - z) * g
yhat_linear = nd.dot(h, self.Why) + self.by
yhat = softmax(yhat_linear, temperature=temperature)
outputs.append(yhat)
return (outputs, h)
```

Now in the example, the SGD function takes place in every iteration, but we need to add manually a âdelay_rateâ, that will do the update every N iterations, so you have enough aggregated gradients. So I am modifying `SGD`

in the example like this:

```
delay_rate = 4 # This says to aggregate over 4 batch iterations before updating
# modified SGD that takes the average of the gradients
def SGD(params, lr, _delay_rate):
for param in params:
param[:] = param - (lr / _delay_rate) * param.grad
```

in the initial example, delay_rate is the default behaviour, but if you aggregate gradients over say 4 iterations, you need to divide their magnitude with 4.

Then in the training loop I would replace the line:

```
SGD(params, learning_rate)
```

with (assuming youâve defined the class GRU with the name net somewhere

```
if (i/delay_rate == 0): # update every delay_rate iterations
SGD(params, learning_rate,delay_rate)
# Now manually zero the grads
for param in net.params:
param.zero_grad()
```

hope this helps. By the way I am a newbie in RNNs, just started learning, so I donât know if what I say needs modifications for your model.

Cheers