Forward-backward pass being a bottleneck in multi-gpu training

Copying @ThomasDelteil’s answer here for points 2 and 3 for greater visibility.

“wanted to take the time to run some experiment to give you more data points, but what I would recommend is trying horovod on a single node. horovod use nccl for GPU to GPU communication and each GPU is running its own process. for horovod this discussion might be helpful for you Distributed training questions