Debug fine, but run OOM...HELP!

Mooonside · December 17, 2018, 4:16am

Hi! I am using mxnet1.3 with CUDA 9.2 on Ubuntu 1604, the thing is when i debug my code using PyCharm IDE, the training part goes on well, but when i run the code directly, it reports OOM when calling xxx.asnumpy() in one of my callback function. My problem is: 1. Why is the code’s performance different? 2. If OOM do occurs, shouldn’t the error be raised in calling of forward&backward function?

larroy · December 17, 2018, 2:20pm

Pycharm is likely using a lot more memory doing display of your variables. Pycharm adds variable values on the source window. I’m not suprised about this.

Mooonside · December 17, 2018, 3:29pm

So if debugging leads to more memory consumption, why OOM doesn’t occur in debugging mode? Moreover, i believe that debugging in pycharm is consuming memory, which has little to do with gpu’s resources exhaustion…

smolix · December 17, 2018, 9:46pm

@Mooonside - I think that the fact that xxx.asnumpy() is causing the problem should be a pretty good clue that you’re having a problem on the CPU side (after all, PyCharm doesn’t run on the GPU). Can you monitor memory in parallel, e.g. using top in a terminal to watch consumption from PyCharm and from python proper.

Mooonside · December 18, 2018, 4:36am

Thank you for your reply! So the exact error message is:

mxnet.base.MXNetError: [12:02:14] src/storage/./pooled_storage_manager.h:119: cudaMalloc failed: out of memory

And i check what you said, the CPU’s memory state is:

So i think that it’s GPU’s issue. And i find that even in debug mode, it will also report OOM occasionally, but not every time. Now what confuses me most is whether xxx.asnumpy() occupies any GPU memory? Why OOM doesn’t occur when calling forward&backward()?

Mooonside · December 18, 2018, 4:42am

One more thing is that if i remove all the callback functions(i.e. all the xxx.asnumpy()), the code can run without OOM, and if i add one, running will fail and debugging will fail occasionally. If i add two, then both running and debugging fail…

Topic		Replies	Views
Gluon Multi GPU Out of Memory Issues	6	3419	April 11, 2019
The GPU memory usage is not stable Performance	3	1011	May 12, 2018
What wrong with MxNet asnumpy()? Performance	2	1659	July 29, 2019
GPU memory usage	18	4617	November 23, 2017
Memory issue with Module forward function Performance	2	861	September 19, 2018

Debug fine, but run OOM...HELP!

Related Topics