Distributed training questions

The problem appears only when I try to use different contexts for each network. If I fit everything on the same GPU all good.

Dear all,

does anyone have working code on restoring optimizer states with horovod? My first test (reading the same Trainer states from all nodes) failed.

@feevos, could you be more specific with your problem? Based on your description, it sounds similar to this issue: https://github.com/apache/incubator-mxnet/issues/17357

The solution in horovod is you always store the states using rank0. Please let me know if that makes sense to you.

1 Like

Hi @apeforest, thank you very much for your prompt reply. Extremely appreciated!

Yes, I am saving states only from rank 0 node. So are you suggesting that it is enough to only load the Trainer states in rank 0 and no broadcasting? I will give it a try ASAP.

What I tried in the past, was similar to what I do when loading saved models: I load the parameters (weights) to all workers (instead of loading to rank 0 and then broadcasting from there). In our HPC system this is working well, and it doesn’t give me any problems. From my code when loading weights:

 # If flname_load is empty, initialize from scratch, else, load params.                       
 if (self.config[C.C_FLNM_LOAD_PARAMS] == None):                                              
     self.mynet.initialize(self.config[C.C_INITIALIZER] ,ctx = self.config[C.C_CTX])          
 # This loads parameters to all workers    
 else:                                                                                        
     self._load_params(self.config[C.C_FLNM_LOAD_PARAMS])                                     
                                                                                              
 # @@@@@@@@@@@@ HVD BROADCAST ALL PARAMETERS TO ALL WORKERS @@@@@@@@@@@@@@@                  
 # This takes place ONLY if the parameters were randomly initialized                    
 params = self.mynet.collect_params()                                                         
 if self.config[C.C_FLNM_LOAD_PARAMS] is None and params is not None:                         
         hvd.broadcast_parameters(params, root_rank=0)                                        
 # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

When I tried the same thing with the Trainer object, I got cuda malloc. This failed:

 # Initialize Trainer object - if you want to load previously saved states.                
 self.opt = mx.optimizer.create(self.config[C.C_OPTIMIZER_NAME], **self.config[C.C_TRAINER_PARAMS])  
 self.trainer = hvd.DistributedTrainer(params, self.opt)                                             
 # This routine loads trainer params in all nodes, identical *** FAILS ****                    
 if self.config[C.C_FLNM_LOAD_STATES] is not None:                                                   
     self.trainer.load_states(self.config[C.C_FLNM_LOAD_STATES])

I am afraid I have limited understanding on how the trainer works. So what I tried to do next, was to load the trainer from rank 0, and then broadcast the parameters of the Trainer. Trying to draw parallels between pytorch and mxnet here. I saw this example from pytorch but I could not perform similar actions with mxnet. It seems that for mxnet, there exist only the function broadcast_parameters that accepts as argument a ParameterDict. In searching the source code of optimizer and Trainer, I couldn’t find a function (or a member variable) that has this information - to the best of my limited knowledge.

Any comments/suggestions most welcome. Thank you! I’ll try your suggestion and get back on this.

All the best

So modifying my code to load the trainer states from only “hvd.rank()==0” makes the code running fine - I will need to experiment on a real run and see behaviour on performance (24 nodes job, will take a while to launch …), this works in debugging mode:

# @@@@@@@@@@@@@@@@@@@@@ RESTORING OPTIMIZER STATES @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      
if self.config[C.C_FLNM_LOAD_STATES] is not None and hvd.rank()==0:                     
    self.trainer.load_states(self.config[C.C_FLNM_LOAD_STATES])                   

Again, thank you for all the help!

1 Like

And the performance doesn’t change - it continues from where it was. So it’s looking good, thank you!!!

image

2 Likes

hi feevos,
how can i contact you ? i have issue in running distributed training with autogluon. even i posted the topic here
Post

Hi @hariram , apologies for late reply, just saw your message. I am not familiar with autogluon, therefore I cannot help there, but happy to answer questions if I know on distributed training in general. I am using horovod in custom HPC resources (homogeneous system).

Kind regards

Thanks @feevos
Now issue resolved related to distributed training.
but my other main concern is here Autogluon simple program failing with distributed training - #8 by hariram

(last but one post - which i think there is no implementation of it yet.)