As I’ve seen by far, all distributed training are all performed on AWS. However we have a cluster of Xeon E5 with 100G OPA, we are willing to try mxnet on this cluster. I found on document here https://mxnet.incubator.apache.org/tutorials/python/kvstore.html
which says that
’'
Run on Multiple Machines
Based on parameter server, the updater runs on the server nodes. When the distributed version is ready, we will update this section.
''
does that mean mxnet do not support custom cluster distributed training currently?
If so, any plan when will this be implemented? or in fact its not gonna be done?