Simple float16 example not working

olivcruche · December 3, 2018, 6:52pm

Hi, I’d like to showcase benefit of mixed precision training. I have this simple net:

def BuildNet():

    net = gluon.nn.HybridSequential()
    
    with net.name_scope():
    
        net.add(gluon.nn.Conv2D(channels=20, kernel_size=3, activation='relu'))
        net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
        net.add(gluon.nn.Conv2D(channels=50, kernel_size=3, activation='relu'))
        net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
    
        # The Flatten layer collapses all axis, except the first one, into one axis.
        net.add(gluon.nn.Flatten())
        net.add(gluon.nn.Dense(num_fc, activation="relu"))
        net.add(gluon.nn.Dropout(.3))
        net.add(gluon.nn.Dense(num_outputs))
        
    return net

This runs fine:

net = BuildNet()

# Parameter initialization
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

# Softmax cross-entropy loss
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

# Optimizer
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})

# Training loop
epochs = 3
smoothing_constant = .01

curr_loss = mx.nd.zeros((1,), ctx=ctx)
for e in range(epochs):
    tick = time.time()
    for i, (data, label) in enumerate(train_data):

        data = data.as_in_context(ctx)
        label = label.as_in_context(ctx)

        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)

        loss.backward()
        trainer.step(data.shape[0])

        ##########################
        #  Keep a moving average of the losses
        ##########################
        curr_loss += nd.mean(loss)

    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)

    print("Epoch {}. Loss: {}, Train_acc {}, Test_acc {}, {:.4f}" 
          .format(e, curr_loss.asscalar()/len(train_data), train_accuracy, test_accuracy, time.time()-tick))

This errors:
Only changes are casting the net and the data to float16.

net = BuildNet()

# Parameter initialization
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

net.cast('float16')

# Softmax cross-entropy loss
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

# Optimizer
trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='sgd',
    optimizer_params={'learning_rate': .1,
                      'multi_precision': True})

# Training loop
epochs = 3
smoothing_constant = .01

curr_loss = mx.nd.zeros((1,), ctx=ctx)
for e in range(epochs):
    tick = time.time()
    for i, (data, label) in enumerate(train_data):

        data = data.as_in_context(ctx).astype('float16')
        label = label.as_in_context(ctx).astype('float16')

        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)

        loss.backward()
        trainer.step(data.shape[0])

        ##########################
        #  Keep a moving average of the losses
        ##########################
        curr_loss += nd.mean(loss)

    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)

    print("Epoch {}. Loss: {}, Train_acc {}, Test_acc {}, {:.4f}" 
          .format(e, curr_loss.asscalar()/len(train_data), train_accuracy, test_accuracy, time.time()-tick))

I followed this https://mxnet.incubator.apache.org/faq/float16.html quite carefully.
What is wrong?

Error is

MXNetError: [18:50:32] src/operator/contrib/../elemwise_op_common.h:133: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node at 1-th input: expected float32, got float16

olivcruche · December 3, 2018, 7:42pm

I think error was in my evaluate_accuracy function: it was using float32. Now I switched to float16 and things run fine, however it is still confusing: runtime is exactly the same as the float32 version. Why is that?

sad · December 4, 2018, 12:12am

Not exactly sure what you mean by “runtime is exactly the same as float32”. Do you mean how long the model takes to execute or something else. If it’s the former you might want to try multiple passes so that the performance gain is more easily identifiable.

olivcruche · December 4, 2018, 9:12am

yes, how long the model take, I’ll give a try to more epochs

olivcruche · December 4, 2018, 11:11am

I think this happened because of 4 things:

the model was too simple (3 conv3 + 1 fc128), so bulk of the time was spent in IO.
manually coded train_acc and validation_acc functions, probably less efficient than the built in accuracy = mx.metric.Accuracy()
the train_acc and validation_acc were called at each epoch, and took up to 50% of epoch runtime
3 epochs may be a bit small to realize gains

I use resnets (50 and 152), the built-in accuracy and removed the validation accuracy measurement and the difference is now more visible (36.3s vs 50.5s over 5 epochs, a 28% improvement

Topic		Replies	Views
Resnet does not want to float16 Gluon	3	1257	December 13, 2018
Help with simple classification Gluon	3	330	September 18, 2020
My first neural network for classification in mxnet gluon, I don't understand what is the problem	1	707	July 23, 2019
Error when trying to import a trained net: multiple outputs with name	2	815	December 1, 2018
Import export fails for float16 net after cast Gluon	2	1391	November 7, 2018

Simple float16 example not working

Related Topics