shouldn’t the masked softmax loss output be
because first sequence has 4 elements each equals to 2.30126,
second sequence has 2 elements each equals to 2.30126,
dividing by their valid length means 2.301264/4=2.30126 and 2.301262/2=2.30126.
it seems to me it’s divided by 4 which includes the padding length, makes no sense!