How to resume training with optimizer status

qjfytz · January 16, 2019, 7:43am

I want to realize the “resume training” function for my training program. But I don’t know how to correctly resume the optimizer status.

My program is like this:

opt = mx.optimizer(learning_rate=lr, ....)

ctx = [...]
sym = get_symbol() # The function define network
model = mx.mod.Module(sym=sym, ctx=ctx)

model.fit(...)

Now I want to save the model after training 1k steps and then resume it from the checkpoint. Since the optimizer status are also required to be resumed (i.e. The momentum of parameters for a momentum optimizer), I use the mxnet.Module API, and the codes to perform saving and loading are:

##### save #####
def batch_callback(params):
    if global_step == 1000:
        model.save_checkpoint(prefix, 0, save_optimizer_states=True)
        sys.exit(0)

The batch_callback is registered to the model.fit() function.

##### load #####
model = mx.mod.Module.load(prefix, 0, load_optimizer_states=True)
model.bind(...)
arg_params, aux_params=model.get_params()

model.fit(optimizer = opt, optimizer_params=('learning_rate', args.lr),
          arg_params=arg_params, aux_params=aux_params,
          batch_end_callback = batch_callback)

However, I find that the model is not correctly resumed. The results are quite bad. I am not sure but it seems that the parameters of model are random initialized rather than load from checkpoint.

So, what is the correct way to resume training with resuming optimizer status?

NRauschmayr · February 5, 2019, 5:57pm

The code snippets look ok. Can you provide a small reproducible example so that I can debug the issue?

In general saving and loading a model with optimizer states can be done the following way:

Save:

model.save_checkpoint("test", 0, save_optimizer_states=True)

Load:

model = mx.mod.Module.load("test", 0, load_optimizer_states=True)
model.bind(data_shapes=train_iter.provide_data, label_shapes=train_iter.provide_label)
model.init_optimizer(optimizer='sgd', optimizer_params=(('learning_rate', 0.1), ))

Topic		Replies	Views
Learning rate doesnt decrease after resuming training in MXNets Gluon example	1	488	January 15, 2019
What's the different between mx.model.load_checkpoint and mx.module.Module.load?	1	2467	October 16, 2017
How to update learning rate during training with symbol programming Discussion	4	696	November 3, 2018
Alternated training parts of a model with mx.sym Discussion	1	483	March 20, 2018
There are some question during the training process Discussion	1	490	June 1, 2018

How to resume training with optimizer status

Related Topics