How GPU communicates with each other

Hi, All:

I have a question about how GPU communicates with each other.
For example, I have one CPU running ps as server, and have two GPUs (in one machine) running as worker.
Does these two GPUs communicates with each other directly, i.e. without going through CPU?
If so, what method do they use to communicate? What function call or what level function calls do they use to communicate? Is that possible for me to identify the function calls? like the send() in the socket communications?

The same questions in the case that these two GPUs are mounted in two different machine?


Hi @yue.yang,

Communication methods for synchronizing the parameters will depend on the type of KVStore you are using for your Trainer (if using Gluon API). When set to local, all gradients will be copied from GPU to CPU memory, and the weights will be updated on CPU memory. Setting to device will avoid this, and if multiple GPUs are being used the gradients will be copied from GPU to GPU (not via CPU).

One thing to note is that different machines will have different GPU topologies that effect the speed of the GPU to GPU communication for specific GPU pairs. You can use the following command to inspect the topology for your machine, to understand the connection interfaces:

nvidia-smi topo --matrix

When using multiple machines there won’t be direct GPU to GPU communication across different machines, but you can still GPU to GPU communication on the same machine if you set the KVStore to dist_device_sync.

1 Like

Hi, thomelane:

Thank you very much for your quick and clear answer.

Just try to follow up:

If we set to KVStore = device, then there would be communication (for gradients copy) from GPU to GPU directly. Then what method are they using to communication? via GPU memory copy directly? or via socket communication? or else?

Is that any API exposed for us to trace the function and communication among GPUs?