How could I reuse memory of the NDArray? According to the example, when I want to copy the memory of opencv to NDArray, I have to create a new NDArray, reallocate new buffer.
Is it possible to ask the NDArray reuse the float pointer directly, or copy the data of float pointer into the memory of old NDArray, which already allocate the memory?
Because reallocate memory every times is not necessary, this cost could be avoided by users easily if the NDArray provide the api like the cv::Mat of opencv.
Premature optimization is evil, but we do not need to become pessimistic either.
MXNet has an asynchronous execution engine. In a typical training setting, your training is done on GPU and data preprocessing is done on CPU. Because of the asynchronous nature of MXNet, preprocessing of the next batch of data can happen in parallel to graph computation of the current batch and the cost of an extra memcpy during preprocessing would not impact your training performance.
Also keep in mind that every single neural network operator in MXNet results in a memcpy after computation. The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.
The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.
Not for training, but want to save some extra cost when doing inference. Even it is very negligible for performance, but it is not a bad things to avoid the cost if the api is easy enough to use.
By the way, do c++ api of mxnet support arbitrary batch size when doing inference?
Unlike the issue of reallocate memory/copy, I thinkbatch size do have big impact on performance
Arbitrary batch-size is supported, but every time batch-size changes, a new allocation for the network happens which slows down inference. You can consider having a few batch-size buckets to avoid memory allocation for each new batch-size.
The only way to avoid memory reallocation is by having the network allocate memory for the largest possible batch-size and reuse that same memory when batch-size is smaller.
If you use the Gluon API, calling HybridBlock.hybridize(static_alloc=True) will do exactly that. With CPP API, AFAIK, there isn’t a way to specify this ability. Perhaps @leleamol who’s working on an update to CPP API may be able to point you to a solution.
The mxnet::cpp API supports creating shared executors. You would need to load the model and parameters only once and create shared executors catering to different batch-sizes.
std::vector<float> image_vector;
//predict_images is a vector which contain the images converted to the format required
//by the mxnet network
for(auto const &img : predict_images){
std::copy(std::begin(img), std::end(img), std::back_insert(image_vector));
}
executor->arg_dict()["data0"].SyncCopyFromCPU(image_vector.data(), image_vector.size());
executor->arg_dict()["data1"] = num_words;
From technical viewpoint, is it possible to support variable input batch size without pre-allocate maximum batch-size?
1.always, the data is : args_map[“data0”] = NDArray(Shape(max_batch_size,channel, height, width), global_ctx, false);
2. I used fixed fixed batch_size,if inference img_num < batch_size, use invalid data for (batch_size - img_num) imgs. ps :batch_size=16,img_num=10,another 6 img can be any data. Using fixed batch_size, inference time is fixed.
And it’s based on this model and example in Gluon (Python). See valid_length parameter in MeanPoolingLayer and you’ll see the equivalent usage of data1 in the C++ code. So in theory you could use a subset of the words for sentiment prediction, but in this case it’s set to be the number of words in the line to use them all.
I mean could we make inference time scale with input size.
Example :
Assume max batch size is 4(“data0” == 4)
1 input is 1 sec, 2 input is 1.2 sec, 4 input is 1.5 sec,8 input is 3 sec
What are the benefits?
Assume maximum batch size was 10, your actual input size was 5, but no matter your input size was, inference time still the same, you still need to feed a float vector with “fix batch size, 10”, unless “data1” can tell the executor to manage input size bigger than 10, else I don’t see the benefits of declaring “data1”.
I checked last year,mxnet does not implement ’reshape‘ function in c++.(ps, caffe has reshape function, so batch_size can be changed before forward). you can check if mxnet support this now.
I only use data0
maybe data1 support different batch_size ? is this means we can have 2 different batch_size and share one model?
Would this cause the memory of gpu reallocate?If yes, then not worth to use it unless it is very fast.
using mxnet1.3.1, do not implement reshape yet
// TODO(zhangchen-qinyinghua)
// To implement reshape function
void Reshape();
Not sure about latest codes, but when I try to build, I encounter lot of bugs on windows, will wait until next stable version release.
I hope so, the examples do not mention about this at all.
Edit : Check codes on github, haven’t implemented reshape yet. From technical view, is it possible to reshape the size without memory reallocation? If cannot, what are the cost it need to take?It only reallocate the memory related to the input data, or need to reallocate memory for whole network?
Edit : opencv4 planning to add cuda support for dnn module, maybe in the future we could have a better choice for inferencing
In my opinion,similar to caffe, mxnet is possible to reshape without memory reallocation.
without reshape function, different batch_size need to reallocate memory for whole network, unless ‘data0’, ‘data1’ support different batch_size , which mentioned above
to me, fixed size is fine.you can set a suitable batch_size, if given samples num > batch_size,run the forward for several times, else, just using the batch_size, which means some useless computation is performed.
i tested batch_size=4,8,16 or more, bigger batch_size , smaller mean forward time of sample (all_time/batch_size) , but the difference very small.
compared to caffe , mxnet run more fast use less memory when run same network.