Hello everyone, I’m using MxNet 1.5.1 &&CUDA 10.0 to do distributed training. I use horovod as well. The fc layer in my model is too large, so I just try to apply model parallel in that layer. Previous work is done by others and they use Module API. I implement an allreduce CustomOp to join it to the symbol chain. This op looks like this:
def __init__(self, average=True, name=None): self.average = bool(average == 'True' or average is True) self.name = name self._num_ranks = hvd.local_rank() def forward(self, is_train, req, in_data, out_data, aux): x = in_data name = self.name if self.name else 'hvd-no-name' y = env.hvd_framework().allreduce(x, average=self.average, name=name) self.assign(out_data, req, y.asnumpy()) def backward(self, req, out_grad, in_data, out_data, in_grad, aux): out_grad = out_grad if self.average: out_grad = out_grad / self._num_ranks self.assign(in_grad, req, out_grad)
In Module construction, I pass the gpu device as the value for
group2ctxs argument, then use
with AttrScope to control symbols to run on the gpu device. However, once I tried to change the last line in the forward() function to
self.assign(out_data, req, mx.nd.array(y)) then I will receive an error from Horovod: what(): cudaEventSynchronize failed: an illegal memory access was encountered
I think this error comes from here: https://github.com/horovod/horovod/blob/v0.18.2/horovod/common/ops/cuda_operations.cc#L87
By the way, is there any way to know data for each symbol in GPU memory or not in Module API programming? In Gluon programming this is easy but for Module programming, I cannot find a way to do this. Even I pass the gpu context during the Module construction, I still see my softmax+fc part are not computed in GPU with Nvidia nsight system profiling.