# Training loss never changes but accuracy oscillates

I am using mxnet to train a VQA model, the input is `(6244,)` vector and the output is a single label

During my epoch, the loss never change but the accuracy is oscillating in a small range, the first 5 epochs are

``````Epoch 1. Loss: 2.7262569132562255, Train_acc 0.06867348986554285
Epoch 2. Loss: 2.7262569132562255, Train_acc 0.06955649207304837
Epoch 3. Loss: 2.7262569132562255, Train_acc 0.06853301224162152
Epoch 4. Loss: 2.7262569132562255, Train_acc 0.06799116997792494
Epoch 5. Loss: 2.7262569132562255, Train_acc 0.06887417218543046
``````

This is a multi-class classification problem, with each answer label stands for a class, so I use softmax as final layer and cross-entropy to evaluate the loss, the code of them are as follows

So why the loss never change?.. I just directly get if from `cross_entropy`

``````trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

epochs = 10
moving_loss = 0.
best_eva = 0
for e in range(epochs):
for i, batch in enumerate(data_train):
data1 = batch.data[0].as_in_context(ctx)
data2 = batch.data[1].as_in_context(ctx)
data = [data1, data2]
label = batch.label[0].as_in_context(ctx)
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data[0].shape[0])

moving_loss = np.mean(cross_entropy.asnumpy()[0])

train_accuracy = evaluate_accuracy(data_train, net)
print("Epoch %s. Loss: %s, Train_acc %s" % (e, moving_loss, train_accuracy))
``````

The eval function is as follows

``````def evaluate_accuracy(data_iterator, net, ctx=mx.cpu()):
numerator = 0.
denominator = 0.
metric = mx.metric.Accuracy()
data_iterator.reset()
for i, batch in enumerate(data_iterator):
data1 = batch.data[0].as_in_context(ctx)
data2 = batch.data[1].as_in_context(ctx)
data = [data1, data2]
label = batch.label[0].as_in_context(ctx)
output = net(data)

metric.update([label], [output])
return metric.get()[1]``````

Hi.

You’re doing your accuracy evaluation within the `autograd.record()` scope in your `evaluate_accuracy` function. That’s throwing off your network gradients for the next optimization step. Take out the `with autograd.record()` line in the function and you should see your loss start to converge. Also there’s no need to call `data_iterator.reset()` in your eval function.

Feel free to post an update if that doesn’t solve your issue or your run into other issues.

I do it in following ways and get the output as follows

I keep the `data_iteratore.reset()` and remove the `with autograd.record()` in evaluation, the loss does not change and accuracy becomes zero

``````Epoch 1. Loss: 6.835763931274414, Train_acc 0.0
Epoch 2. Loss: 6.835763931274414, Train_acc 0.0
Epoch 3. Loss: 6.835763931274414, Train_acc 0.0
Epoch 4. Loss: 6.835763931274414, Train_acc 0.0
Epoch 5. Loss: 6.835763931274414, Train_acc 0.0
``````

I remove both `data_iteratore.reset()` and `with autograd.record()` in evaluation, loss does not change and accuracy becomes `nan`

``````Epoch 1. Loss: 6.835763931274414, Train_acc nan
Epoch 2. Loss: 6.835763931274414, Train_acc nan
Epoch 3. Loss: 6.835763931274414, Train_acc nan
Epoch 4. Loss: 6.835763931274414, Train_acc nan
Epoch 5. Loss: 6.835763931274414, Train_acc nan
``````

The `with autograd.record()` in training step could not be removed, otherwise a compile error will occur

One thing I have to add is that, in official VQA demo, https://gluon.mxnet.io/chapter08_computer-vision/visual-question-answer.html, it uses `with autograd.record()` in evaluation step and the loss is still converging, I tested its code and get some output as follows

``````Epoch 1. Loss: 2.0590806428121624, Train_acc 0.4791814630681818
Epoch 2. Loss: 1.7539432328664892, Train_acc 0.5143821022727273
Epoch 3. Loss: 1.4294043381950257, Train_acc 0.5496271306818182
Epoch 4. Loss: 1.1836000213868916, Train_acc 0.5796431107954545
Epoch 5. Loss: 1.1122687829740323, Train_acc 0.6065488873106061
``````

I think I have to go back to the function of `autograd.record`

I go back to the official tutorial and it says `autograd.record` is something holding the gradient. Is this meaning conflicts with your saying throwing off the gradients

Hi, I meant that you should remove the `with autograd.record()` in the evaluation function ONLY not the training loop. You don’t need to record gradients to calculate accuracy but you need it in the training part to perform backprop and take an optimization step.

Although, I see that in the tutorial autograd is used in the evaluation function so that’s probably not your issue. Looks like you’re missing the `data_train.reset()` line in your training loop though.

Yes, I think I did exactly what you say. See the third post here Training loss never changes but accuracy oscillates. I did not specify I only remove the `autograd` in evaluation, and I added it in the post, sorry for the unclearness

I see. And did you try adding `data_train.reset()` in your training loop. Like in the example? The example has

``````for e in range(epochs):
data_train.reset()
for i, batch in enumerate(data_train):``````

OHHHH…Thank you so much, I think this is what cause the problem

By the way, may you explain why this problem occurs if we do not reset the dataset?

The model begin to converge, all the problems were caused by the missing of `reset()`.

One more question, how will the training continue if we do not reset at the beginning of the epoch? I think at the end of the last epoch it has already moved to the last batch of the data, how could the training continue without throwing an error?

My guess is that training does not continue after the first epoch, and that is exactly why `moving_loss` never changes. This can be easily tested by trying printing something inside the training loop.

This is reasonable, but why the accuracy is oscillating?.. and just add, after the first epoch, the sequential epochs still cost the similar amount of time with the first epoch