How GPU communicates with each other

yue.yang · July 12, 2018, 7:08pm

Hi, All:

I have a question about how GPU communicates with each other.
For example, I have one CPU running ps as server, and have two GPUs (in one machine) running as worker.
Does these two GPUs communicates with each other directly, i.e. without going through CPU?
If so, what method do they use to communicate? What function call or what level function calls do they use to communicate? Is that possible for me to identify the function calls? like the send() in the socket communications?

The same questions in the case that these two GPUs are mounted in two different machine?

Thanks,

thomelane · July 13, 2018, 1:17am

Hi @yue.yang,

Communication methods for synchronizing the parameters will depend on the type of KVStore you are using for your Trainer (if using Gluon API). When set to local, all gradients will be copied from GPU to CPU memory, and the weights will be updated on CPU memory. Setting to device will avoid this, and if multiple GPUs are being used the gradients will be copied from GPU to GPU (not via CPU).

One thing to note is that different machines will have different GPU topologies that effect the speed of the GPU to GPU communication for specific GPU pairs. You can use the following command to inspect the topology for your machine, to understand the connection interfaces:

nvidia-smi topo --matrix

When using multiple machines there won’t be direct GPU to GPU communication across different machines, but you can still GPU to GPU communication on the same machine if you set the KVStore to dist_device_sync.

yue.yang · July 13, 2018, 1:34am

Hi, thomelane:

Thank you very much for your quick and clear answer.

Just try to follow up:

If we set to KVStore = device, then there would be communication (for gradients copy) from GPU to GPU directly. Then what method are they using to communication? via GPU memory copy directly? or via socket communication? or else?

Is that any API exposed for us to trace the function and communication among GPUs?

Thanks

Topic		Replies	Views
Gluon sync mode in single node? Gluon	1	322	November 7, 2018
Single-node multi-gpu machine Gluon	3	1288	October 13, 2018
GPU utils is low when training yolov3 network by gluoncv Gluon	1	395	December 5, 2019
Question about memory usage during using Multiple GPUs Gluon	2	1180	January 27, 2018
Does SyncBatchNorm require peer access within GPUs	2	537	September 25, 2018

How GPU communicates with each other

Related Topics