I am using our internal implemented parameter server. Is there anyone can give me an example about how to do a distribute training using gluon API.
- how to split and load data on different machine;
- how to compute gradient on slaver machine;
- how to gather gradient from slaver machine and update parameters on master.
Hi @shuokay,
You can find a great tutorial for distributed training using Gluon here. And another here.
Although not Gluon specific this video gives a good walk through of distributed training with MXNet. And another can be found here.
A fully working example of distributed training can be found here which is used for image classification.
1 Like
You’ll see the main ideas are:
- Creating a distributed key value store with
mxnet.kv.create('dist')
- Sampling different batches of the data on each of the workers
-
split_and_load
ing partitions of each batch to the devices on the corresponding worker