Synchronization and Update Function of Parameter Server (distributed KV-Store)

MaxiBoether · January 20, 2021, 9:08am

Hi,

I’ve got two questions about the distributed KV-store (mx.kv.create('dist_sync')) available in MXNet.

The documentation states that pushes return immediatly and are executed asynchronously. Then it states we can use _barrier() to sync all workers. This however is discussed in the context of asynchronous execution.

I do not fully understand what “syncing all workers” means here. Workers are not parameter servers (schedulers or servers); instead, they are the one executing the control flow. So, if I’m not training a neural network, but just using the parameter server (which is what I want to do for testing purposes), the worker is the one calling kv.push. How can the server then sync all workers? The server cannot push directly to the worker, can it? The worker can only pull from or push to the server. I imagine that calling _barrier() force-pulls at all clients, but imagine we do something like in the documentation:

        >>> # push a list of keys.
        >>> # single device
        >>> keys = ['4', '5', '6']
        >>> kv.push(keys, [mx.nd.ones(shape)]*len(keys))
        >>> b = [mx.nd.zeros(shape)]*len(keys)
        >>> kv.pull(keys, out=b)
        >>> print b[1].asnumpy()
        [[ 1.  1.  1.]
        [ 1.  1.  1.]]

I do not see how the kvstore knows which variable to update on all workers when _barrier() is called.

Secondly, I wonder about push guarantees. Is there also a barrier that forces completion of all pending operations? Because again, _barrier() seems to be about workers, not servers. I want to benchmark parameter server primitives and therefore, I need to know that an operation is finished.

I found this webpage: mxnet: mxnet::KVStore Class Reference

which mentions that in a distributed KV, we cannot use wait_to_read, but instead a Wait(keys) method, but I cannot find any KVStore::Wait() method in C++ or Python.

We can use set_updater locally, but that will have no effect on the aggregation of the workers. However, there is the set_optimizer function, but that is on another level of abstraction. If I want to define my own update function like in the documentation (Distributed Key-Value Store — Apache MXNet documentation):

def update(key, input, stored):
    print("update on key: %d" % key)
    stored += input * 2

I can’t figure out how to use this in the distributed KV-store context, as kv._set_updater(update) only works for local KV-stores, but set_optimizer required inbuilt optimizers like SGD.

Thank you very much for your help!

Best,
Maximilian

Topic		Replies	Views
Is there any communication between parameter servers?	1	374	September 9, 2019
Question about Distribution Training using launcher.py Discussion multi-host , unix-based	3	468	February 19, 2019
Kvstore special for distributed fc Discussion	0	350	August 4, 2018
Dist training with geographically very distant servers Performance	2	402	September 5, 2019
Issue: How to store a large distributed training model with KVStore？ MXNet Model Server	1	552	June 13, 2018

Synchronization and Update Function of Parameter Server (distributed KV-Store)

Related Topics