Help with simple classification

i have written a simple program to test mxnet which i done the same for others like tf and torch.
net = nn.Sequential()
with net.name_scope():


trainer = gluon.Trainer(
    net.collect_params(),'sgd',{'learning_rate': 0.1}

def acc(output, label):
    # output: (batch, num_output) float32 ndarray
    # label: (batch, ) int32 ndarray
    return (output.argmax(0)== label.astype('float32')).mean().asscalar()

for epoch in range(100):
    train_loss,train_acc,valid_acc = 0.0,0.0,0.0
    for data,label in training:
        with autograd.record():
            output = net(data)
            loss = softmax_crossentropy(output,label)
        train_loss += loss.mean().asscalar()
        train_acc += acc(output, label)
    for data, label in testing:
        valid_acc += acc(net(data), label)
    print("Epoch %d: loss %.3f, train acc %.3f, test acc %.3f, in %.1f sec" % (
            epoch, train_loss/len(train_data), train_acc/len(train_data),
            valid_acc/len(testing), time.time()-tic))

my problem is that loss get lower and lower and accuracy never goes up and also seems the network doesn’t get better over time , just repeating without any optimization.
i`m very new to mxnet and most likely i made a big mistake somewhere
data is very simple , and one hot encoded labels.
based on this great help i noticed i was calculating the accuracy and loss wrong when printing, but still it goes on rather strange
loss is getting lower and lower but accuracy still stuck around 0.1

for other beginners like me, it seems that grad system is deprecated and mxnet module is the new policy, so i converted the whole code to module based.
it became like this and worked well for me:
train_iter =, train_label, batch_size, shuffle=True)
val_iter =, test_label, batch_size)
mdata = mx.sym.var(‘data’)
fc1 = mx.sym.FullyConnected(data=mdata, num_hidden=128)
act1 = mx.sym.Activation(data=fc1, act_type=“relu”)

The second fully-connected layer and the corresponding activation function

fc2 = mx.sym.FullyConnected(data=act1, num_hidden = 64)
act2 = mx.sym.Activation(data=fc2, act_type=“relu”)

MNIST has 10 classes

fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10)

Softmax with cross entropy loss

mlp = mx.sym.SoftmaxOutput(data=fc3, name=‘softmax’)
logging.getLogger().setLevel(logging.DEBUG) # logging to stdout

create a trainable module on compute context

ctx = mx.gpu() if mx.context.num_gpus() else mx.cpu()
progress = mx.callback.ProgressBar(len(train_data)/batch_size,40)
mlp_model = mx.mod.Module(symbol=mlp, context=ctx), # train data
eval_data=val_iter, # validation data
optimizer=‘sgd’, # use SGD to train
optimizer_params={‘learning_rate’:0.01}, # use fixed learning rate
eval_metric=‘acc’, # report accuracy during training
# batch_end_callback = progress, # output progress for each 100 data batches
num_epoch=100,) # train for at most 10 dataset passes

for future reference and people like myself :slight_smile:
the last layer doesn’t need to be 1 output for softmax cross entropy to work:
this one was enough to make the loss function work properly .