Dataloader with num_workers > 0 crashes

My dataloader for my image-based dataset with num_workers > 0 often crashes due to python mulitprocessing. In fact I get the following error:

IOError: [Errno 104] Connection reset by peer

With num_workers = 0 (default) I have no issues other than training is very slow. Is this issue related to opencv threading? I am using python 2.7

What’s the version of MXNet you’re using?

I am using 1.3 (master branch)

Are you using mxnet.image library or OpenCV directly? If using OpenCV directly, does the problem go away if you only mxnet.image calls?

I am using the following call:

data = mx.image.imread(image_path, flag=1)

Will reduce the num_worker resolve the issue?

Even with num_worker it hangs at receiving the data. My dataset is return a tuple of 4 NDArrays, maybe pickling is slow?

NDArray pickling uses shared memory when num_workers > 0 so that pickling wouldn’t copy over the memory for performance. I have, however, heard of a few users claiming that using Numpy to transfer data between processes is faster than using NDArrays with shared memory pickling. I always assumed that they’re doing something wrong because Numpy doesn’t supports shared memory AFAIK, but maybe there is something I’m missing.

related issues:

Hi guys,
I have a related issue when I try to put a value higher than 0 for num_workers.

mxnet.base.MXNetError:E:\pyjq\tp\opencv\opencv\modules\imgcodecs\src\loadsave.cpp:721: error: (-2) unable to remove temporary file in function cv::imdecode_

I am running the example for finetuning an object-detection network.

I am using MXNET 1.5.0 with CUDA 9.2, OpenCV 3.4.2 and Python 3.7.4 on Windows.

Hi @LauLauThom,

There’s limited support for multiple workers in windows due to the forking system. Maybe try using thread_pool=True in your DataLoader

1 Like