Why does a loop of dummy forward passes use all my gpu memoy?

Hi all,

I noticed that a loop of dummy forward passes (arrays of zeros) will use up nearly all available memory. Why is that so? I assume it is some kind of optimization built into mxnet to use available ressources to speed things up. Is there a way to stop mxnet from doing that?

Here is some sample code. If I run this, all memory of my GPU is allocated, although I am using a very small network and the data is also tiny!

import mxnet as mx
from mxnet import gluon
import mxnet.autograd as ag

class Model(gluon.Block):
    def __init__(self, **kwargs):
        super(Model, self).__init__(**kwargs)
        with self.name_scope():
            self.dense0 = gluon.nn.Dense(20)
            self.dense1 = gluon.nn.Dense(20)
            self.mydense = gluon.nn.Dense(20, prefix='mydense_')

    def forward(self, x):
        x = mx.nd.relu(self.dense0(x))
        x = mx.nd.relu(self.dense1(x))
        return mx.nd.relu(self.mydense(x))

ctx = [mx.gpu()]
net = Model()
net.initialize(mx.init.Xavier(), ctx=ctx)

repeat_dummy = 1000000
for i in range(repeat_dummy):
    with ag.record():
        data = mx.nd.zeros((64,32,32,1), ctx[0])
        output = net(data)
    del output

Is there a way to force Mxnet to free up gpu memory at the end of the for loop that is not needed anymore?

The problem I have is that I need to run several dummy forward passes, but while this works fine with this example code here (although high memory consumption) if I do it with my own network it results in cuda out of memory exceptions, although the actual training would run without problems with the memory that I have available.

Thanks for any replies!


You’re right in assuming that there are optimizations to have preallocate a memory pool in mxnet to avoid the overhead of requesting new memory from cuda for single operations. You can’t unilaterally stop it AFAIK but there are a number of command line arguments you can set to control memory options in MXNet in particular MXNET_GPU_MEM_POOL_RESERVE. See the following link for more on that:

However, why do you need to run several dummy forward passes?

@adrian the problem is that MXNet execution is asynchronous, which means operations gets enqueued and executed as soon as possible. What happens here is that you are enqueueing operations faster than you are processing them.

Add a mx.nd.waitall() in your loop and the memory will remain constant because mxnet will wait for each execution to be completed before going to the next. Otherwise you are simply loading your GPU with all the data to be processed next.

Not every layer is executed in one forward pass (based on random noise) so not every parameter is initialized after a single forward pass.

Thanks for the explanation and solution!