Training freezes on EC2 P3 instances with 100% Volatile GPU utilization

I am training a 3-layer BiLSTM for sequence labeling for text.
Tried it on a different versions of Mxnet(1.1/1.2/1.3) an cuda (8/9).

On P3 instances, the training pipeline freezes non-deterministically with 100% volatile GPU utilization.
The same pipeline runs fine on P2 instances (python packages, cuda and mxnet versions being the same)

Would you kindly provide a minimum reproducible example (code)? It would be impossible to diagnose based on the current information. Thanks!

My model is a vanilla 3 layer Bi-LSTM like described here

I am training a character level sequence prediction task where I limit the sequence length to 200 characters and batch size to 32. I also tried with smaller sequence lengths and batch sizes. The characters are just mapped to integers. The pipeline is akin to character level models for predicting punctuations.
Unfortunately I cannot share the exact data and dataloader I am using as it is proprietary.

Are there any similar public implementations I can try to run? Or any other diagnostics I can provide which might help?