Hi,
I’ve got two questions about the distributed KV-store (mx.kv.create('dist_sync')
) available in MXNet.
- The documentation states that pushes return immediatly and are executed asynchronously. Then it states we can use
_barrier()
to sync all workers. This however is discussed in the context of asynchronous execution.
I do not fully understand what “syncing all workers” means here. Workers are not parameter servers (schedulers or servers); instead, they are the one executing the control flow. So, if I’m not training a neural network, but just using the parameter server (which is what I want to do for testing purposes), the worker is the one calling kv.push
. How can the server then sync all workers? The server cannot push directly to the worker, can it? The worker can only pull from or push to the server. I imagine that calling _barrier() force-pulls at all clients, but imagine we do something like in the documentation:
>>> # push a list of keys.
>>> # single device
>>> keys = ['4', '5', '6']
>>> kv.push(keys, [mx.nd.ones(shape)]*len(keys))
>>> b = [mx.nd.zeros(shape)]*len(keys)
>>> kv.pull(keys, out=b)
>>> print b[1].asnumpy()
[[ 1. 1. 1.]
[ 1. 1. 1.]]
I do not see how the kvstore knows which variable to update on all workers when _barrier() is called.
Secondly, I wonder about push guarantees. Is there also a barrier that forces completion of all pending operations? Because again, _barrier() seems to be about workers, not servers. I want to benchmark parameter server primitives and therefore, I need to know that an operation is finished.
I found this webpage: mxnet: mxnet::KVStore Class Reference
which mentions that in a distributed KV, we cannot use wait_to_read, but instead a Wait(keys) method, but I cannot find any KVStore::Wait() method in C++ or Python.
- We can use
set_updater
locally, but that will have no effect on the aggregation of the workers. However, there is theset_optimizer
function, but that is on another level of abstraction. If I want to define my own update function like in the documentation (Distributed Key-Value Store — Apache MXNet documentation):
def update(key, input, stored):
print("update on key: %d" % key)
stored += input * 2
I can’t figure out how to use this in the distributed KV-store context, as kv._set_updater(update)
only works for local KV-stores, but set_optimizer
required inbuilt optimizers like SGD.
Thank you very much for your help!
Best,
Maximilian