Memory issue with Module forward function

shababqcd · September 18, 2018, 1:04pm

I recently tried to implement CycleGAN in symbolic API (alongside with module API). I realized while training, the MXNET is taking more and more of my RAM to a point that it takes up the whole memory. (32GB of RAM). and this is not while loading the dataset but while the training was happening. So I investigated different parts of my code to see where the problem is from. I realized this is happening while performing the forward propagation in the generator network.
So I made this code, There is no database loaded, it is just a network accepting a random vector and generates an image at the output. The network is the generator network from

Here we just do the forward propagation, 10000000 times. If running this code, you can see that python is taking more and more of your RAM while the code is running.
Can Someone please explain to me what’s happening and how can I fix this. (I am using TitanX (maxwell) and CUDA 8.0 on ubuntu 16.4 and mxnet 1.0.0)

import mxnet as mx
class RandIter(mx.io.DataIter):
    def __init__(self, batch_size, ndim):
        self.batch_size = batch_size
        self.ndim = ndim
        self.provide_data = [('rand', (batch_size, ndim, 1, 1))]
        self.provide_label = []

    def iter_next(self):
        return True

    def getdata(self):
        #Returns random numbers from a gaussian (normal) distribution
        #with mean=0 and standard deviation = 1
        return [mx.random.normal(0, 1.0, shape=(self.batch_size, self.ndim, 1, 1))]


batch_size=16
Z = 100
rand_iter = RandIter(batch_size, Z)

no_bias = True
fix_gamma = True
epsilon = 1e-5 + 1e-12

rand = mx.sym.Variable('rand')

g1 = mx.sym.Deconvolution(rand, name='g1', kernel=(4,4), num_filter=1024, no_bias=no_bias)
gbn1 = mx.sym.BatchNorm(g1, name='gbn1', fix_gamma=fix_gamma, eps=epsilon)
gact1 = mx.sym.Activation(gbn1, name='gact1', act_type='relu')

g2 = mx.sym.Deconvolution(gact1, name='g2', kernel=(4,4), stride=(2,2), pad=(1,1), num_filter=512, no_bias=no_bias)
gbn2 = mx.sym.BatchNorm(g2, name='gbn2', fix_gamma=fix_gamma, eps=epsilon)
gact2 = mx.sym.Activation(gbn2, name='gact2', act_type='relu')

g3 = mx.sym.Deconvolution(gact2, name='g3', kernel=(4,4), stride=(2,2), pad=(1,1), num_filter=256, no_bias=no_bias)
gbn3 = mx.sym.BatchNorm(g3, name='gbn3', fix_gamma=fix_gamma, eps=epsilon)
gact3 = mx.sym.Activation(gbn3, name='gact3', act_type='relu')

g4 = mx.sym.Deconvolution(gact3, name='g4', kernel=(4,4), stride=(2,2), pad=(1,1), num_filter=128, no_bias=no_bias)
gbn4 = mx.sym.BatchNorm(g4, name='gbn4', fix_gamma=fix_gamma, eps=epsilon)
gact4 = mx.sym.Activation(gbn4, name='gact4', act_type='relu')

g5 = mx.sym.Deconvolution(gact4, name='g5', kernel=(4,4), stride=(2,2), pad=(1,1), num_filter=3, no_bias=no_bias)
generatorSymbol = mx.sym.Activation(g5, name='gact5', act_type='tanh')

#Hyper-parameters
sigma = 0.02
lr = 0.0002
beta1 = 0.5
# Define the compute context, use GPU if available
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

#=============Generator Module=============
generator = mx.mod.Module(symbol=generatorSymbol, data_names=('rand',), label_names=None, context=ctx)
generator.bind(data_shapes=rand_iter.provide_data)
generator.init_params(initializer=mx.init.Normal(sigma))

print('Training...')

for i in range(10000000):
    rbatch = rand_iter.next()
    generator.forward(rbatch, is_train=True)

safrooze · September 18, 2018, 6:07pm

MXNet has an asynchronous execution engine. When you call forward, all that happens is that the sequence of operations along with the necessary data for those operations is scheduled to be executed by the engine. What your loop is doing, is basically scheduling several operations faster than they can be handled by the engine, so memory increases. If you use any of the blocking calls (nd.waitall(), NDArray.asnumpy(), NDArray.asscalar(), or NDArray.wait_to_read() ), then the symptom will stop.

shababqcd · September 19, 2018, 4:12pm

Thanks for the reply. Adding mx.ndarray.waitall() after each iteration solved the memory issue.

Topic		Replies	Views
Why does a loop of dummy forward passes use all my gpu memoy? Gluon python , memory , general-question	4	904	March 18, 2019
Memory leak when running cpu inference Gluon python , memory , gluon-cv	10	4591	January 22, 2020
Memory profiling for MxNet Performance	4	1463	October 11, 2017
Is it normal that mxnet takes up much more GPU memory at the start up? Discussion	3	2907	May 30, 2018
Gluon Multi GPU Out of Memory Issues	6	3423	April 11, 2019

Memory issue with Module forward function

Related Topics