One node failure but other nodes hang in mulit-node distributed training

We implemented the multi-node training with MXNET in AWS batch, using ps-lite server. Occasionally, we found one node failure due some run-time error or exceptions, the other nodes still keep running or waiting. Is there anyway, that we can terminate the whole training job with multi-node in case one node failed. For the failed node, is there anyway that it can signal/msg the other nodes, and terminate those nodes? Thanks.