Gluoncv SSD working in notebook, failing in docker on same notebook

olivcruche · March 27, 2020, 2:18pm

Hi,
I’m training a gluoncv SSD script in a sagemaker P3.2xl notebook instance (V100 GPU).
The training runs fine in the notebook
The exact same script, running in the same instance but within the official AWS SageMaker docker image for MXNet (https://github.com/aws/sagemaker-mxnet-container) errors:

Worker timed out after 120 seconds. This might be caused by 

            - Slow transform. Please increase timeout to allow slower data loading in each worker.
            - Insufficient shared_memory if `timeout` is large enough.
            Please consider reduce `num_workers` or increase shared_memory in system.

I never saw that error in 2 years. What is it? Why happening in docker and not out of docker?

cbarre · April 2, 2020, 7:16pm

SSDDefaultTrainTransform is being slow to load, or the images may not be loading?
try to pin_memory=True, and make sure that if you have num_workers > 0 make sure the docker container actually is accessing the data.

Topic		Replies	Views
Cryptic failure of SSD training with gluoncv 0.5.0 Gluon	1	503	October 23, 2019
Gluoncv pikachu killing jupyter on p3.2x	4	670	November 29, 2018
The problem when I am training SSD on my own pascal format dataset Gluon	2	463	December 24, 2018
How to use AMP with gluoncv SSD? Gluon	2	1979	October 22, 2019
GluonCV on Jupyter: "The kernel appears to have died. It will restart automatically." Gluon	3	1832	November 29, 2018

Gluoncv SSD working in notebook, failing in docker on same notebook

Related Topics