Training freezes on EC2 P3 instances with 100% Volatile GPU utilization

fakira · October 9, 2018, 7:12am

I am training a 3-layer BiLSTM for sequence labeling for text.
Tried it on a different versions of Mxnet(1.1/1.2/1.3) an cuda (8/9).

On P3 instances, the training pipeline freezes non-deterministically with 100% volatile GPU utilization.
The same pipeline runs fine on P2 instances (python packages, cuda and mxnet versions being the same)

VishaalKapoor · October 9, 2018, 9:37pm

Would you kindly provide a minimum reproducible example (code)? It would be impossible to diagnose based on the current information. Thanks!

fakira · October 10, 2018, 3:36am

My model is a vanilla 3 layer Bi-LSTM like described here
https://gluon.mxnet.io/chapter05_recurrent-neural-networks/rnns-gluon.html

I am training a character level sequence prediction task where I limit the sequence length to 200 characters and batch size to 32. I also tried with smaller sequence lengths and batch sizes. The characters are just mapped to integers. The pipeline is akin to character level models for predicting punctuations.
Unfortunately I cannot share the exact data and dataloader I am using as it is proprietary.

Are there any similar public implementations I can try to run? Or any other diagnostics I can provide which might help?

Topic		Replies	Views
Gluon CNN training with GPU inactive while ctx = mx.gpu(0) Gluon	2	1900	September 27, 2018
Pre-requisites for dist training "linearity"? Gluon	5	458	December 8, 2018
Documentation Request: Model Parallelism Tutorial Performance	6	1841	March 10, 2018
Low GPU usage training cifar10 Performance	3	2106	June 24, 2018
The GPU memory usage is not stable Performance	3	1011	May 12, 2018

Training freezes on EC2 P3 instances with 100% Volatile GPU utilization

Related Topics