Restarting training running on multiple machines if one of the machine dies

gurumurthys · October 10, 2017, 10:02pm

I am running resnet training on multiple machines using mxnet kvstore. If one of the machine dies (and gets restarted), is there a way for me to restart the training on that machine without having to shutdown training on other machines and restart the training on all machines with stored weights?

Thanks
Guru

madjam · October 11, 2017, 3:23am

Are you running kvstore in dist_async or dist_sync mode?
When running in dist_async mode, a node (worker) being down should not impact training on other workers as workers don’t tightly synchronize with each other. However, if you are running in dist_sync mode, for each mini-batch gradients are aggregated from all workers before the weights are updated (on server). So even if one worker is down training on other workers halts.

gurumurthys · October 11, 2017, 4:05pm

Thanks for your response. I am currently running in dist_sync mode. What happens if I restart the machine and start the mxnet program, will the worker will then reconnect with the scheduler and start processing again from where it stopped? Or do I have to restart the scheduler and other workers?

madjam · October 11, 2017, 5:56pm

I don’t think restarting a worker will result in the worker resuming from where it left off. This hasn’t been tested so hard to tell if it works or if it ever worked. So the best course of action is to restart everything. However, by saving checkpoints periodically you can avoid restarting training from scratch. When you (re)start training simply check and load the most recent checkpoint, if one exists, and continue training from there.

Topic		Replies	Views
Performance of distributed training using dist_sync kv_store Performance	1	473	March 13, 2020
Kvstore for distributed multi-gpu training Performance	10	2743	November 16, 2017
Rcnn forward slow during distributed training 0.12 Performance	4	651	February 27, 2018
Question about distributed Synchronous and Asynchronous training Gluon	0	317	December 5, 2019
How MxNet average the workload among different workers while distributed training? Discussion	0	222	February 23, 2021

Restarting training running on multiple machines if one of the machine dies

Related Topics