Problem with multiprocessing and CPU shared storage

Hi,

I’m using python multiprocessing to speed up the dataloader in a Gluon package. The contexts of the net, trainer and ndarray have the format @cpu_shared(0). The code runs well on my local machine. However, when I use an AWS EC2 machine with more core, the multiprocessing part succeeds, but MXNet part ends up with error about CPU shared storage.

I would like to figure out what’s wrong with the EC2 instance. The error message is attached. Thanks in advance!

File “/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ metric.py”, line 1289, in update
File “/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ ndarray/ndarray.py”, line 1998, in asscalar
File “/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ ndarray/ndarray.py”, line 1980, in asnumpy
File “/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ base.py”, line 252, in check_call
mxnet.base.MXNetError: [22:47:24] src/storage/./cpu_shared_storage_manager.h:183 : Failed to open shared memory. shm_open failed with error Too many open files

It is quite easy to speed up dataloader by either:

#train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform)
train_data = mx.gluon.data.vision.MNIST(train=True).transform_first(data_xform,False)

or

train_data = mx.gluon.data.vision.MNIST(train=True)#.transform_first(data_xform)
#DO NOT transform in datasets
...
for epoch in range(epochs)
  for data,label in train_loader
    data=data_xform(data)

Here is an example and benchmark to show a how to use dataloader faster

Although I cannot find why multiprocessing fails, this solution may helps.

for epoch in range(epochs)
for data,label in train_loader
data=data_xform(data)
I’m using gluonts for time series forecasting. The problem is the train_loader in their package does not support num_workers. Hence I use multiprocessing to speed up “for data,label in train_loader”.

This part actually succeeds with multiple cores on EC2 machine, but the mxnet gives me error when I calculate loss function for the NN’s output:
[22:47:24] src/storage/./cpu_shared_storage_manager.h:183 : Failed to open shared memory. shm_open failed with error Too many open files

https://docs.oracle.com/cd/E19623-01/820-6168/file-descriptor-requirements.html


If you really want to use multiprocessing,links above may help with dealing “too many open files” problem