How to normalize the softmax and how the accuracy works?

I am working on a VQA project and have basically 2 questions now.

First of all I would introduce the dataset, every training question has 3 answers, so I fit the sample into the model like (question, ans1), (question, ans2), (question, ans3), So if I use the softmax to predict and I can get one answer at the end, so the accuracy could be at most 0.33

Besides, I use loss = gluon.loss.SoftmaxCrossEntropyLoss() to be the training loss, and mx.metric.Accuracy() to be the evaluation, with the update pair as metric.update([label], [output]), where label is the training answer and output is the softmax vector of all possible answers

The training loop is using

cross_entropy = loss(output, label)

Here is something really strange, I use just 3 samples to test and I got the accuracy 73% (actually the accuracy could at most be 0.33 in my dataset) after 10 epochs. And to test this issue, I predict the training data with the model, and it gives really strange answer.

Here is my training data

what is in front of the chair,mirror,pool,shelf,
what is the color of the person's clothes in video,blue,dark blue,black blue,
what is the person doing in video,cleaning up,wiping mirror,washing cup,
where is the person in video,indoor,washroom,residence,
is the person sitting or standing in the video,standing,standing,standing

And my predicting result is (each training question has 3 answers, and I just predict the one with the maximum softmax value)

what is in front of the chair,shelf,
what is the color of the person's clothes in video,cleaning up,
what is the person doing in video,washroom,
where is the person in video,kissing,
is the person sitting or standing in the video,light white

I use np.argmax to get the answer from the softmax layer. And I print the softmax result, the first 3 lines of it is

answer is shelf with softmax [15.491705] <NDArray 1 @cpu(0)>
answer is cleaning up with softmax [8.109538] <NDArray 1 @cpu(0)>
answer is washroom with softmax [8.194625] <NDArray 1 @cpu(0)>
answer is kissing with softmax [7.8190136] <NDArray 1 @cpu(0)>
answer is light white with softmax [6.411439] <NDArray 1 @cpu(0)>

So my 2 questions, 1) Obviously the accuracy is not as high as 73%, so how do the function metric.update() evaluate the accuracy, 2) How could the softmax value be over 1 or be negative number, isn’t it normalized? The official Accuracy evaluation says that ‘‘Prediction values for samples. Each prediction value can either be the class index, or a vector of likelihoods for all classes.’’ according to, and it just consider the class with the maximum likelihood. How could it be if the likelihoods is above 1???

I know it is bothering to deal with so many things, so if anyone could explain the bold type question first, and maybe I can debug the code from it, thank you!

  1. I probably would not use Accuracy metric at all, since this metric should show how well your model works. In your case all 3 answers are possible. That means that by splitting data per answer you effectively make a term “accuracy” meaningless. I recommend to calculate accuracy separately on original dataset (before the splitting) by just looking if the resulting answer in the list of supported answers. That doesn’t change the way you calculate loss function.

  2. What does your model outputs? Is it softmax on whole your vocabulary? Softmax is always normalized to be equal to 1. Check this out:

import mxnet as mx

a = mx.nd.array([-1, 15, 0.4])
b = a.softmax() # b is [ 1.12535112e-07 9.99999404e-01 4.56352183e-07]
c = sum(b) # c is 1

So, I am curious how exactly you get Softmax values? You don’t treat SoftmaxCrossEntropyLoss as outputing softmax, right?

I think I did not do what you mean, I just followed the official demo of VQA, and use the final layer as

self.fc2 = nn.Dense(num_category)

and use the loss as

loss = gluon.loss.SoftmaxCrossEntropyLoss()

the update of the model is

with autograd.record():
    output = net(data)
    cross_entropy = loss(output, label)

nothing more, and according to the result of the official demo, this would not affect the result, even if it has not been normalized to softmax

So, actually, we can use SoftmaxCrossEntropyLoss as the loss and use something else than softmax layer as the output, am I right? (At the beginning I thought if I use SoftmaxCrossEntropyLoss and the final layer would be normalized automatically to softmax)

Thanks for providing the reference.

Yes, fc2 doesn’t return softmax. If you want to get Softmax out of the output, you should write output.softmax().

While technically it is more correct, it won’t change the result of prediction - if you look into the VQA example they use argmax to get the final results: output = np.argmax(output.asnumpy(), axis = 1). Argmax of softmaxed result will be the same.

They don’t apply softmax in the network itself exactly because they use SoftmaxCrossEntropyLoss. If you look into documentation - it applies Softmax to predictions internally before calculating final values.

So, answering your questions:

  1. Yes, we can use SoftmaxCrossEntropyLoss, but we shouldn’t apply Softmax when feeding the output to loss to avoid double softmaxing.

  2. When we calculating final output we can apply softmax if we want to see probability like distribution of results. If we don’t care about probabilities and just want to do nd.argmax to get the most probable prediction, then you can do it even without softmaxing, because argmax(output) will produce the same result as argmax(softmax(output))

Completely solved my questions, thank you very much! Answer accepted in # post 2

Yet another question, additional to these 2, is that why the original accuracy could be 73% (actually 99% when converge)

First let me explain, this is due to the mistake of training label, but I am still curious why it give this result. As you see, each question has 3 answers, and these 3 answers may be different, so let us say there are 3 training samples, and they form a batch, and by mistake, all the labels fed to the model are the same answers, like this, the ans1 in line 1 and line 2 are the same variable

question1, ans1, question1, ans2, question1, ans3
question2, ans1, question2, ans2, question2, ans3

So the question batch is question1, question1, question1, and the answer batch is ans1, ans2, ans3, and during the evaluation, it should compared (my_ans1, my_ans2, my_ans3) with (ans1, ans2, ans3), and according to the 99% accuracy, it should give the answer (ans1, ans2, ans3), and actually it really output these answers

So as you see, the training batch is (question1, question1, question1), 3 same data, I thought it would give 3 same answers, maybe all ans1 or ans2 or ans3, but gives different answers (actually it is right because in this data the result has a connection with the data index in the batch, when it is the first data in the batch is gives ans1, second gives ans2). So my final question is, does the training take the data index in the batch as an attribute? Otherwise how could it achieve the result above

Training doesn’t take data index in batch as an attribute. The only information training receives is the information you explicitly pass in data when calling output = net(data).

It is hard for me to explain why it happens. The only thing I can assume is that maybe some information about an index is passed to the network when original question is transformed into ndarray? Is there is a pattern if you provide a question1, 2, … not in a training loop, but in evaluation? Like, maybe you always receive the first answer out of 3 possible?