One node failure but other nodes hang in mulit-node distributed training

joshua · November 18, 2020, 6:49am

We implemented the multi-node training with MXNET in AWS batch, using ps-lite server. Occasionally, we found one node failure due some run-time error or exceptions, the other nodes still keep running or waiting. Is there anyway, that we can terminate the whole training job with multi-node in case one node failed. For the failed node, is there anyway that it can signal/msg the other nodes, and terminate those nodes? Thanks.

Topic		Replies	Views
Is it still possible to add worker node during training? Discussion	2	503	December 5, 2017
How to run distributed training on my own cluster NOT AWS?	1	648	November 7, 2017
Question about Distribution Training using launcher.py Discussion multi-host , unix-based	3	472	February 19, 2019
How to save model in distributed training? Discussion	1	425	March 21, 2018
MXNetError: Traceback (most recent call last): File "../src/ndarray/ndarray.cc", line 507 MXNetError: Check failed: delay_alloc: Discussion	0	461	April 18, 2023

One node failure but other nodes hang in mulit-node distributed training

Related Topics