Hi,
I am trying to train a Faster RCNN in a custom dateset using the script provided on gluoncv
site.
I am using a RTX 2070 with 8Gb of memory in a docker container with cuda102 and the git version of gluoncv
once the pip version would yield import error for the Faster R-CNN model.
The training runs normally for a random number of epochs and then I get the following error:
mxnet.base.MXNetError: [18:13:30] src/operator/random/./../tensor/./broadcast_reduce-inl.cuh:554: Check failed: err == cudaSuccess (2 vs. 0) : Name: reduce_kernel ErrStr:out of memory
I am using batch_size=1
and disable-hybridization
, reduced short
to 600 in order to minimize the memory usage.
This problem is really strange, because last year I was able to train Faster R-CNN in the same hardware with the same dataset, that time I used the previous version of the script (I realized that it is updated from time to time).
So, what might be the problem?