How to train models with multiple gpus in C++

mx.mod.Module provides a convenient high level api for model training in python. But due to some reasons, I need to train my models in pure C++ environment. I am wondering if it is also possible to support multiple gpu device with the interfaces in cpp-package/include/mxnet-cpp.
Currently, I can only find the Executor in cpp-package/include/mxnet-cpp/executor.h (which supports single gpu training).

I am not a C++ binding expert, but looking through the API I don’t see either an obvious way of doing that out of the box. For example if you wanted to perform data parallelism (training multiple copy of the same model in parallel on each GPU, effectively allowing you to increase your overall batch size), you could proceed in the following way:

  • Initializing your model on each GPU
  • Splitting and copying your training data evenly on each GPU
  • Passing the data batches forward
  • Computing the gradients.
  • Aggregating your gradients and updating your model weights on each GPU

Which is effectively what the module API is doing

@ThomasDelteil :
Thank you for this information. I wanted to ask the same question than @nicklhy.
I tried to do as you said. Thus, I copied the networks with the initialized weights to the same values. (I then get the same grad_arrays value computed on the same training data batch). I then concatenated the grad_arrays values, that I then feed back in the parameters updaters with opt->Update(i, exec1->arg_arrays[i], combinedGradArray1[i]);.

But unfortunately, that is not training on two GPUs, even if the basis model was training fine on one GPU. What could go wrong?

I got a better result by summing the grad_arrays values instead of concatenating them. But it raises a questions: how to handle different size of batch?
And still, what makes the concatenation of batch not working?