Distributed training questions

feevos · January 22, 2020, 7:24am

The problem appears only when I try to use different contexts for each network. If I fit everything on the same GPU all good.

feevos · April 22, 2020, 11:22am

Dear all,

does anyone have working code on restoring optimizer states with horovod? My first test (reading the same Trainer states from all nodes) failed.

apeforest · April 22, 2020, 6:46pm

@feevos, could you be more specific with your problem? Based on your description, it sounds similar to this issue: https://github.com/apache/incubator-mxnet/issues/17357

The solution in horovod is you always store the states using rank0. Please let me know if that makes sense to you.

feevos · April 23, 2020, 6:01pm

Hi @apeforest, thank you very much for your prompt reply. Extremely appreciated!

Yes, I am saving states only from rank 0 node. So are you suggesting that it is enough to only load the Trainer states in rank 0 and no broadcasting? I will give it a try ASAP.

What I tried in the past, was similar to what I do when loading saved models: I load the parameters (weights) to all workers (instead of loading to rank 0 and then broadcasting from there). In our HPC system this is working well, and it doesn’t give me any problems. From my code when loading weights:

 # If flname_load is empty, initialize from scratch, else, load params.                       
 if (self.config[C.C_FLNM_LOAD_PARAMS] == None):                                              
     self.mynet.initialize(self.config[C.C_INITIALIZER] ,ctx = self.config[C.C_CTX])          
 # This loads parameters to all workers    
 else:                                                                                        
     self._load_params(self.config[C.C_FLNM_LOAD_PARAMS])                                     
                                                                                              
 # @@@@@@@@@@@@ HVD BROADCAST ALL PARAMETERS TO ALL WORKERS @@@@@@@@@@@@@@@                  
 # This takes place ONLY if the parameters were randomly initialized                    
 params = self.mynet.collect_params()                                                         
 if self.config[C.C_FLNM_LOAD_PARAMS] is None and params is not None:                         
         hvd.broadcast_parameters(params, root_rank=0)                                        
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

When I tried the same thing with the Trainer object, I got cuda malloc. This failed:

 # Initialize Trainer object - if you want to load previously saved states.                
 self.opt = mx.optimizer.create(self.config[C.C_OPTIMIZER_NAME], **self.config[C.C_TRAINER_PARAMS])  
 self.trainer = hvd.DistributedTrainer(params, self.opt)                                             
 # This routine loads trainer params in all nodes, identical *** FAILS ****                    
 if self.config[C.C_FLNM_LOAD_STATES] is not None:                                                   
     self.trainer.load_states(self.config[C.C_FLNM_LOAD_STATES])

I am afraid I have limited understanding on how the trainer works. So what I tried to do next, was to load the trainer from rank 0, and then broadcast the parameters of the Trainer. Trying to draw parallels between pytorch and mxnet here. I saw this example from pytorch but I could not perform similar actions with mxnet. It seems that for mxnet, there exist only the function broadcast_parameters that accepts as argument a ParameterDict. In searching the source code of optimizer and Trainer, I couldn’t find a function (or a member variable) that has this information - to the best of my limited knowledge.

Any comments/suggestions most welcome. Thank you! I’ll try your suggestion and get back on this.

All the best

feevos · April 24, 2020, 4:25am

So modifying my code to load the trainer states from only “hvd.rank()==0” makes the code running fine - I will need to experiment on a real run and see behaviour on performance (24 nodes job, will take a while to launch …), this works in debugging mode:

# @@@@@@@@@@@@@@@@@@@@@ RESTORING OPTIMIZER STATES @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      
if self.config[C.C_FLNM_LOAD_STATES] is not None and hvd.rank()==0:                     
    self.trainer.load_states(self.config[C.C_FLNM_LOAD_STATES])

Again, thank you for all the help!

feevos · May 4, 2020, 1:55pm

And the performance doesn’t change - it continues from where it was. So it’s looking good, thank you!!!

hariram · December 31, 2020, 5:12am

hi feevos,
how can i contact you ? i have issue in running distributed training with autogluon. even i posted the topic here
Post

feevos · January 11, 2021, 7:13am

Hi @hariram , apologies for late reply, just saw your message. I am not familiar with autogluon, therefore I cannot help there, but happy to answer questions if I know on distributed training in general. I am using horovod in custom HPC resources (homogeneous system).

Kind regards

hariram · January 11, 2021, 7:38am

Thanks @feevos
Now issue resolved related to distributed training.
but my other main concern is here Autogluon simple program failing with distributed training - #8 by hariram

(last but one post - which i think there is no implementation of it yet.)

Topic		Replies	Views
Data parallelism for ConvLSTM Gluon	3	692	July 25, 2019
[gluon]How to do distribute training with internal implemented ps? Gluon	2	433	May 3, 2018
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	763	November 14, 2018
mxnet.gluon.data.DataLoader / RandomSampler shuffle Gluon	2	864	May 2, 2018
Gluon NLP Batchify Gluon	1	389	November 26, 2019

Distributed training questions

Related Topics