Reuse memory of mxnet::cpp::NDArray

How could I reuse memory of the NDArray?
According to the example, when I want to copy the memory of opencv to NDArray, I have to create a new NDArray, reallocate new buffer.

Is it possible to ask the NDArray reuse the float pointer directly, or copy the data of float pointer into the memory of old NDArray, which already allocate the memory?

No it is not possible. What are you trying to do that makes you concerned about the memory usage?

Because reallocate memory every times is not necessary, this cost could be avoided by users easily if the NDArray provide the api like the cv::Mat of opencv.

Premature optimization is evil, but we do not need to become pessimistic either.

MXNet has an asynchronous execution engine. In a typical training setting, your training is done on GPU and data preprocessing is done on CPU. Because of the asynchronous nature of MXNet, preprocessing of the next batch of data can happen in parallel to graph computation of the current batch and the cost of an extra memcpy during preprocessing would not impact your training performance.

Also keep in mind that every single neural network operator in MXNet results in a memcpy after computation. The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.

The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.

Not for training, but want to save some extra cost when doing inference. Even it is very negligible for performance, but it is not a bad things to avoid the cost if the api is easy enough to use.

By the way, do c++ api of mxnet support arbitrary batch size when doing inference?
Unlike the issue of reallocate memory/copy, I thinkbatch size do have big impact on performance

Thanks for your helps.

Arbitrary batch-size is supported, but every time batch-size changes, a new allocation for the network happens which slows down inference. You can consider having a few batch-size buckets to avoid memory allocation for each new batch-size.

Any plan to avoid reallocation of the network?

Not that practical because

  1. memory of gpu/cpu are limited
  2. impossible to predict how many faces/persons/etc would appear in the frame at runtime

Is this problem hard to solve?Like need to change a lot of codes, would have big impact on the architectures etc?

The only way to avoid memory reallocation is by having the network allocate memory for the largest possible batch-size and reuse that same memory when batch-size is smaller.

If you use the Gluon API, calling HybridBlock.hybridize(static_alloc=True) will do exactly that. With CPP API, AFAIK, there isn’t a way to specify this ability. Perhaps @leleamol who’s working on an update to CPP API may be able to point you to a solution.

1 Like

The mxnet::cpp API supports creating shared executors. You would need to load the model and parameters only once and create shared executors catering to different batch-sizes.

I have written an RNN inference example (the PR is out for review) to demonstrate inference with variable input size.
Here is the link https://github.com/apache/incubator-mxnet/pull/13680

Please let me know if it helps.

2 Likes

Thanks, this helps a lot, I will pull it from github and use it after this pull request are merged.

Some questions about the example.

About the constructor

args_map["data0"] = NDArray(Shape(num_words, 1), global_ctx, false);
args_map["data1"] = NDArray(Shape(1), global_ctx, false);
  1. num_words is analogous to batch size of computer vision task?
  2. if 1 is correct, maximum batch size is same as num words?
  3. “data1” is the batch size at runtime?

According to PredictSentiment

std::vector<float> index_vector(num_words, GetIndexForWord("<eos>"));
int num_words = ConverToIndexVector(input_text, &index_vector);

executor->arg_dict()["data0"].SyncCopyFromCPU(index_vector.data(), index_vector.size());
executor->arg_dict()["data1"] = num_words; 
  1. Why do you initialize index_vector if you are going to clear it in ConverToIndexVector?
  2. index_vector.size should has the same value as num_words, why don’t just use it to replace num_words?

If I want to apply this technique to computer vision task, what changes I need to do? Are following procedures correct?

When construct

args_map["data0"] = NDArray(Shape(max_batch_size, height, width, channel), global_ctx, false);
args_map["data1"] = NDArray(Shape(1), global_ctx, false);

When predict

std::vector<float> image_vector;
//predict_images is a vector which contain the images converted to the format required
//by the mxnet network
for(auto const &img : predict_images){
    std::copy(std::begin(img), std::end(img), std::back_insert(image_vector));
}
executor->arg_dict()["data0"].SyncCopyFromCPU(image_vector.data(), image_vector.size());
executor->arg_dict()["data1"] = num_words; 

From technical viewpoint, is it possible to support variable input batch size without pre-allocate maximum batch-size?

Thanks

1.always, the data is : args_map[“data0”] = NDArray(Shape(max_batch_size,channel, height, width), global_ctx, false);
2. I used fixed fixed batch_size,if inference img_num < batch_size, use invalid data for (batch_size - img_num) imgs. ps :batch_size=16,img_num=10,another 6 img can be any data. Using fixed batch_size, inference time is fixed.

1 Like
  1. This is a limit of mxnet or cuda?
  2. If inference time is fixed, why do we need to declare “data0” and “data1”? Why not just declare “data0”?

Hi @stereomatchingkiss,

Could you clarify what you’re referring to here?

As far as I can tell data1 is the number of words in the input line, which is different from batch size.

@leleamol please can you confirm this? thanks.

And it’s based on this model and example in Gluon (Python). See valid_length parameter in MeanPoolingLayer and you’ll see the equivalent usage of data1 in the C++ code. So in theory you could use a subset of the words for sentiment prediction, but in this case it’s set to be the number of words in the line to use them all.

I mean could we make inference time scale with input size.

Example :
Assume max batch size is 4(“data0” == 4)

1 input is 1 sec, 2 input is 1.2 sec, 4 input is 1.5 sec,8 input is 3 sec

What are the benefits?

Assume maximum batch size was 10, your actual input size was 5, but no matter your input size was, inference time still the same, you still need to feed a float vector with “fix batch size, 10”, unless “data1” can tell the executor to manage input size bigger than 10, else I don’t see the benefits of declaring “data1”.

  1. I checked last year,mxnet does not implement ’reshape‘ function in c++.(ps, caffe has reshape function, so batch_size can be changed before forward). you can check if mxnet support this now.
  2. I only use data0
  3. maybe data1 support different batch_size ? is this means we can have 2 different batch_size and share one model?

Would this cause the memory of gpu reallocate?If yes, then not worth to use it unless it is very fast.

using mxnet1.3.1, do not implement reshape yet

 // TODO(zhangchen-qinyinghua)
 // To implement reshape function
 void Reshape();

Not sure about latest codes, but when I try to build, I encounter lot of bugs on windows, will wait until next stable version release.

I hope so, the examples do not mention about this at all.

Edit : Check codes on github, haven’t implemented reshape yet. From technical view, is it possible to reshape the size without memory reallocation? If cannot, what are the cost it need to take?It only reallocate the memory related to the input data, or need to reallocate memory for whole network?

Edit : opencv4 planning to add cuda support for dnn module, maybe in the future we could have a better choice for inferencing

  1. In my opinion,similar to caffe, mxnet is possible to reshape without memory reallocation.
  2. without reshape function, different batch_size need to reallocate memory for whole network, unless ‘data0’, ‘data1’ support different batch_size , which mentioned above
  3. to me, fixed size is fine.you can set a suitable batch_size, if given samples num > batch_size,run the forward for several times, else, just using the batch_size, which means some useless computation is performed.
  4. i tested batch_size=4,8,16 or more, bigger batch_size , smaller mean forward time of sample (all_time/batch_size) , but the difference very small.
  5. compared to caffe , mxnet run more fast use less memory when run same network.

Great to know this is not a limit of cuda or cudnn

Thanks, adopt this solution already and designed a generic class for it