Best choice of kvstore parameter in fit methods

I would like to discuss the best option for kvstore parameter that is passed to the fit method of module API.

For single machine, single GPU:
If we don’t have GPU memory constraints, is it always faster to use device instead of local?

For single machine, multiple GPU:
Again if we don’t have GPU memory constraints, is device always better? The documentation says “When using a large number of GPUs, e.g. >=4, we suggest using device for better performance.”

For multiple machines, multiple GPU:
For synchronous updates, which is better? dist_sync or dist_device_sync?


In my experience training ResNet, device is usually faster. However, Mu said they found Inception style network tends to be faster using local/dist_sync.