MxNet (Python) version of Keras MLP doesn't learn

I’m doing a binary classification problem on the Pima Indians dataset. My MLP in Keras gets an accuracy of 85%, but the MxNet version of the same network only gets about 63% and outputs mostly constant values. I’ve tried normalising the data, scaling it, changing activations, number of neurons, batch size and epochs, but nothing seems to help. Any suggestions on what’s causing the huge difference?

Here’s the MxNet code:

batch_size=10
train_iter=mx.io.NDArrayIter(mx.nd.array(df_train), mx.nd.array(y_train), 
batch_size, shuffle=True)
val_iter=mx.io.NDArrayIter(mx.nd.array(df_test), mx.nd.array(y_test), batch_size)

data=mx.sym.var('data')

fc1 = mx.sym.FullyConnected(data=data, num_hidden=12)
act1 = mx.sym.Activation(data=fc1, act_type='relu')

fc2 = mx.sym.FullyConnected(data=act1, num_hidden=8)
act2 = mx.sym.Activation(data=fc2, act_type='relu')

fcfinal = mx.sym.FullyConnected(data=act2, num_hidden=2)
mlp = mx.sym.SoftmaxOutput(data=fcfinal, name='softmax')

mlp_model = mx.mod.Module(symbol=mlp, context=mx.cpu())
mlp_model.fit(train_iter,
          eval_data=val_iter,
          optimizer='sgd',
          eval_metric='ce',
          num_epoch=150)

Here’s the Keras code:

model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(df_train_res, y_train_res)

Hi,

the following are not the solution, but hints that may help debugging (I haven’t went into the dataset to understand how it may affect the loss function values, I understand you have a binary classification problem).

The two versions are not identical:

a) The two networks have different dimensionality in the last layer, thus, a different loss function value. Have you tested the value of the loss function - in both cases - for a specific datum? In the Keras version, you output a single probability of Truth/False, in the mxnet implementation you calculate the probability of two classes (again true/false). From the top of my mind, I do not know if ‘ce’ and binary_cross_entropy give the same numerical output for these two cases. Surely you can modify mxnet to output sigmoid of the last layer.
b) You are using different optimizers (have you tested adam in the mxnet version?). If you do test, make sure they use the same default values in both implementations. I would keep batch size, and network architecture identical for both versions for all tests.

Hope the above helps.