Training loss never changes but accuracy oscillates

Litchy · September 4, 2018, 10:04am

I am using mxnet to train a VQA model, the input is (6244,) vector and the output is a single label

During my epoch, the loss never change but the accuracy is oscillating in a small range, the first 5 epochs are

Epoch 1. Loss: 2.7262569132562255, Train_acc 0.06867348986554285
Epoch 2. Loss: 2.7262569132562255, Train_acc 0.06955649207304837
Epoch 3. Loss: 2.7262569132562255, Train_acc 0.06853301224162152
Epoch 4. Loss: 2.7262569132562255, Train_acc 0.06799116997792494
Epoch 5. Loss: 2.7262569132562255, Train_acc 0.06887417218543046

This is a multi-class classification problem, with each answer label stands for a class, so I use softmax as final layer and cross-entropy to evaluate the loss, the code of them are as follows

So why the loss never change?.. I just directly get if from cross_entropy

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

epochs = 10
moving_loss = 0.
best_eva = 0
for e in range(epochs):
    for i, batch in enumerate(data_train):
        data1 = batch.data[0].as_in_context(ctx)
        data2 = batch.data[1].as_in_context(ctx)
        data = [data1, data2]
        label = batch.label[0].as_in_context(ctx)
        with autograd.record():
            output = net(data)
            cross_entropy = loss(output, label)
            cross_entropy.backward()
        trainer.step(data[0].shape[0])
        
        moving_loss = np.mean(cross_entropy.asnumpy()[0])

    train_accuracy = evaluate_accuracy(data_train, net)
    print("Epoch %s. Loss: %s, Train_acc %s" % (e, moving_loss, train_accuracy))

The eval function is as follows

def evaluate_accuracy(data_iterator, net, ctx=mx.cpu()):
numerator = 0.
denominator = 0.
metric = mx.metric.Accuracy()
data_iterator.reset()
for i, batch in enumerate(data_iterator):
    with autograd.record():
        data1 = batch.data[0].as_in_context(ctx)
        data2 = batch.data[1].as_in_context(ctx)
        data = [data1, data2]
        label = batch.label[0].as_in_context(ctx)
        output = net(data)

    metric.update([label], [output])
return metric.get()[1]

sad · September 4, 2018, 5:29pm

Hi.

You’re doing your accuracy evaluation within the autograd.record() scope in your evaluate_accuracy function. That’s throwing off your network gradients for the next optimization step. Take out the with autograd.record() line in the function and you should see your loss start to converge. Also there’s no need to call data_iterator.reset() in your eval function.

Feel free to post an update if that doesn’t solve your issue or your run into other issues.

Litchy · September 5, 2018, 2:58am

I do it in following ways and get the output as follows

I keep the data_iteratore.reset() and remove the with autograd.record() in evaluation, the loss does not change and accuracy becomes zero

Epoch 1. Loss: 6.835763931274414, Train_acc 0.0
Epoch 2. Loss: 6.835763931274414, Train_acc 0.0
Epoch 3. Loss: 6.835763931274414, Train_acc 0.0
Epoch 4. Loss: 6.835763931274414, Train_acc 0.0
Epoch 5. Loss: 6.835763931274414, Train_acc 0.0

I remove both data_iteratore.reset() and with autograd.record() in evaluation, loss does not change and accuracy becomes nan

Epoch 1. Loss: 6.835763931274414, Train_acc nan
Epoch 2. Loss: 6.835763931274414, Train_acc nan
Epoch 3. Loss: 6.835763931274414, Train_acc nan
Epoch 4. Loss: 6.835763931274414, Train_acc nan
Epoch 5. Loss: 6.835763931274414, Train_acc nan

The with autograd.record() in training step could not be removed, otherwise a compile error will occur

One thing I have to add is that, in official VQA demo, https://gluon.mxnet.io/chapter08_computer-vision/visual-question-answer.html, it uses with autograd.record() in evaluation step and the loss is still converging, I tested its code and get some output as follows

Epoch 1. Loss: 2.0590806428121624, Train_acc 0.4791814630681818
Epoch 2. Loss: 1.7539432328664892, Train_acc 0.5143821022727273
Epoch 3. Loss: 1.4294043381950257, Train_acc 0.5496271306818182
Epoch 4. Loss: 1.1836000213868916, Train_acc 0.5796431107954545
Epoch 5. Loss: 1.1122687829740323, Train_acc 0.6065488873106061

I think I have to go back to the function of autograd.record

Litchy · September 5, 2018, 7:04am

I go back to the official tutorial and it says autograd.record is something holding the gradient. Is this meaning conflicts with your saying throwing off the gradients

sad · September 5, 2018, 4:58pm

Hi, I meant that you should remove the with autograd.record() in the evaluation function ONLY not the training loop. You don’t need to record gradients to calculate accuracy but you need it in the training part to perform backprop and take an optimization step.

Although, I see that in the tutorial autograd is used in the evaluation function so that’s probably not your issue. Looks like you’re missing the data_train.reset() line in your training loop though.

Litchy · September 6, 2018, 1:57am

Yes, I think I did exactly what you say. See the third post here Training loss never changes but accuracy oscillates. I did not specify I only remove the autograd in evaluation, and I added it in the post, sorry for the unclearness

sad · September 6, 2018, 6:59pm

I see. And did you try adding data_train.reset() in your training loop. Like in the example? The example has

for e in range(epochs):
    data_train.reset()
    for i, batch in enumerate(data_train):

Litchy · September 7, 2018, 3:43am

OHHHH…Thank you so much, I think this is what cause the problem

By the way, may you explain why this problem occurs if we do not reset the dataset?

Litchy · September 7, 2018, 6:12am

The model begin to converge, all the problems were caused by the missing of reset().

One more question, how will the training continue if we do not reset at the beginning of the epoch? I think at the end of the last epoch it has already moved to the last batch of the data, how could the training continue without throwing an error?

jason_yu · September 7, 2018, 6:46am

My guess is that training does not continue after the first epoch, and that is exactly why moving_loss never changes. This can be easily tested by trying printing something inside the training loop.

Litchy · September 7, 2018, 7:03am

This is reasonable, but why the accuracy is oscillating?.. and just add, after the first epoch, the sequential epochs still cost the similar amount of time with the first epoch

Topic		Replies	Views
About multi label train in mxnet, crossentroy.py or LogisticRegressionOutPut? A strange result!	2	596	November 10, 2018
0 test/train accuracy for Q1.4 Courses	2	460	February 26, 2019
Issues with Softmax Cross Entropy Loss inferred shape Gluon	2	999	June 30, 2019
How to display loss while training? Gluon	2	958	July 15, 2019
Periodic Loss Value when training with “step” learning rate policy	3	909	October 23, 2017

Training loss never changes but accuracy oscillates

Related Topics