Memory leak when running cpu inference

I’m running into a memory leak when performing inference on an mxnet model (i.e. converting an image buffer to tensor and running one forward pass through the model).

A minimal reproducable example is below:

import mxnet
from gluoncv import model_zoo
from import ssd

model = model_zoo.get_model('ssd_512_resnet50_v1_coco')

for _ in range(100000):
  # note: an example imgbuf string is too long to post
  # see gist or use requests etc to obtain
  imgbuf = 
  ndarray = mxnet.image.imdecode(imgbuf, to_rgb=1)
  tensor, orig = ssd.transform_test(ndarray, 512)
  labels, confidences, bboxs = model.forward(tensor)

The result is a linear increase of RSS memory (from 700MB up to 10GB+).

Libraries used: gluoncv==0.3.0, mxnet-mkl==1.3.1

The problem persists with other pretrained models and with a custom model that I am trying to use. And using garbage collectors does not show any increase in objects.

This gist has the full code snippet including an example imgbuf.

This very likely due to you adding ops faster than MXNet is able to process them.
MXNet is foundamentally asynchronous, it runs on eager execution. When you call forward, you effectively say, compute this forward as soon as possible. The python callbacks returns which allows very simple and intuitive parallelism.
To properly benchmark you need to add a synchronous call.
For example mx.nd.waitall() or labels.wait_to_read() or bboxs.asnumpy() etc


Hey thanks for the quick reply.

You are right, in the example above adding the synchronous call stops the memory increasing.

In my actual use case (which I tried to simplify above, but clearly not properly!) I actually already had this in place, and am still seeing constant memory increase. My program uses a queue system to feed image buffers to a function which does the tensor transformation and forward pass, then puts the result back on a different queue. If I perform this without the mxnet component (e.g. either the function returns a fake result, or the function does some ML work using a different library such as pytorch) then the memory is stable.

Any ideas on what may be causing this? Or do you know if there is a way to force mxnet to release all memory?


Could you share a bigger snippet of your code?
MXNet should release the memory once it is out of scope, it gets garbage collected.
My hunch is that you are calling nd.array somewhere and keeping a reference to that object.

Sorry for hijacking.
Have a similar problem, where i repeatedly call a function that loads a model and returns a prediction and memory keeps increasing with number of calls to that function.

here some stripped down example code:

import mxnet as mx
import numpy as np
import cv2
from IPython import embed

CTX = mx.cpu()

def resize(img, img_dims):
    img = cv2.resize(img, (img_dims[0], img_dims[1]))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2)
    img = img[np.newaxis, :].astype(np.float32) / 255.0
    return mx.nd.array(img)

def predict():
    img = cv2.imread('/path/to/some/image.jpg')
    small_img = resize(img.copy(), (224,224))
    model_name = "/path/to/model.json"
    model_params = "path/to/model.params"
    model = mx.gluon.nn.SymbolBlock.imports(model_name, ['data'], model_params, ctx=CTX)
    return model(small_img).asnumpy()

def main(repeats=3):
    for i in range(repeats):
        result = predict()

if __name__ == '__main__':

mxnet = 1.3.0
python = 3.6.6

The idea was to load and predict inside a function such that memory would be freed up once the function call is done and model/data are out of scope.


Hi, two things.

First, regarding the loop, it’s the same issue as above, the engine is asynchronous so what’s happening is that you’re giving it work it faster than it can complete it.

Add a synchronous call the loop, e.g. print(i, result) or mx.nd.waitall()

Also see discussion above by Thomas, et. al.

Second, you shouldn’t re-load your model on each invocation. It’s better to have a class that loads it once and re-uses it for each prediction. :slight_smile:


Hi @VishaalKapoor,

thanks for the reply.
.asnumpy() and mx.nd.waitall() do not prevent this problem from happening unfortunately. As for the load-model-once-make-several-predictions approach: that reduces the problem to some extend as the memory is still continuously increasing, but at a lower rate than with the model-load also happening inside the loop.
Secondly, our use case is server-ish in nature, i.e. depending on the input/request a different model is loaded and used for prediction (which, agreed, is a debatable design decision), so keeping all models in memory at all times is not ideal from a resource point of view.

Could it be an issue with the nn-model itself?


Are you seeing out of memory errors followed by segfaults?

MXNet will re-use memory but the usage may appear to be going up if you look at nvidia-smi. If you see an eventual OOM error than something is wrong. It’s unlikely to be a memory leak, and more likely that you’re hanging on to references of memory some how.

If your model has a fixed number of params it shouldn’t be causing the issue you’re seeing.

I can’t say further without seeing the model. If it’s something you can attach to the post, it would be helpful to debug.

Hi, @abieler @ThomasDelteil @VishaalKapoor have you solved your problem? I encountered this memory leakage problem during inference as well. Using waitall or asnumpy does not prevent this happening.

It is very strange. I use the same dataloader during training and validation, there is no CPU memory leakage. This only happens during inference.

Hi @eb94 @abieler were you able to fix this issue? I tried all the possible ways described above to fix this, but no luck. Please let me know anyone can help me with this. Thanks in advance.

Hi @bryanyzhu @Sathu_Hareesh I was not able to resolve the prolem. Also tried the 1.4.x versions, did not help.
Anyway, I did some testing and found the following:
I limited memory resources artificially in a docker setup and ran my model a couple of times. Setting the available memory to ~ 2 x of what it needs on the first forward pass “fixed” the problem.
As an example, the model was using ~800 Mb of RAM after one forward pass, giving the application 1.6 GB seems to fix the problem (i.e. when allowing 1.4 GB it would crash after a while) So it does seem there is a ceiling on how hungry this becomes.

Not much, but maybe it helps.


1 Like