With Horovod: What(): cudaEventSynchronize failed: an illegal memory access was encountered

aheader · October 29, 2020, 4:53pm

Hello everyone, I’m using MxNet 1.5.1 &&CUDA 10.0 to do distributed training. I use horovod as well. The fc layer in my model is too large, so I just try to apply model parallel in that layer. Previous work is done by others and they use Module API. I implement an allreduce CustomOp to join it to the symbol chain. This op looks like this:

class HorovodAllReduce(mx.operator.CustomOp):

def __init__(self, average=True, name=None):
    self.average = bool(average == 'True' or average is True)
    self.name = name
    self._num_ranks = hvd.local_rank()

def forward(self, is_train, req, in_data, out_data, aux):
    x = in_data[0]
    name = self.name if self.name else 'hvd-no-name'
    y = env.hvd_framework().allreduce(x, average=self.average, name=name)
    self.assign(out_data[0], req[0], y.asnumpy())

def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
    out_grad = out_grad[0]
    if self.average:
        out_grad = out_grad / self._num_ranks
    self.assign(in_grad[0], req[0], out_grad)

In Module construction, I pass the gpu device as the value for group2ctxs argument, then use with AttrScope to control symbols to run on the gpu device. However, once I tried to change the last line in the forward() function to self.assign(out_data[0], req[0], mx.nd.array(y)) then I will receive an error from Horovod: what(): cudaEventSynchronize failed: an illegal memory access was encountered
I think this error comes from here: https://github.com/horovod/horovod/blob/v0.18.2/horovod/common/ops/cuda_operations.cc#L87

By the way, is there any way to know data for each symbol in GPU memory or not in Module API programming? In Gluon programming this is easy but for Module programming, I cannot find a way to do this. Even I pass the gpu context during the Module construction, I still see my softmax+fc part are not computed in GPU with Nvidia nsight system profiling.

aheader · October 29, 2020, 5:19pm

I have found a related issue here: https://github.com/apache/incubator-mxnet/issues/18667
If I change the last line to self.assign(out_data[0], req[0], y.as_in_context(out_data[0].context)) or self.assign(out_data[0], req[0], y), then I will receive the same error as the issue I mention above.
However, my different ranks are in the same machine so the endian format should be the same?

Topic		Replies	Views
Out of Order execution in symbolic graph Discussion	0	314	February 11, 2020
Custom parameter is dumped during a forward call	1	458	January 3, 2020
Access the activation of a custom layer during forward pass	4	1229	September 17, 2020
Mxnet Crashed when entering backward at the second time Gluon	10	889	April 10, 2019
Gradient Calculation time Discussion	1	703	July 16, 2018

With Horovod: What(): cudaEventSynchronize failed: an illegal memory access was encountered

Related Topics