Check failed: e == CUDNN_STATUS_SUCCESS (7 vs. 0) cuDNN: CUDNN_STATUS_MAPPING_ERROR

Hi all,

I was running the insightface model from github with pretrained weights on inference path. I write my own code detecting and recognizing 300 photos inside a directory and it works fine. However, when I tried to call the detection and recognition function from a server script, it only worked once for the first call, the second detection call will result in an error and the entire model crashes.

My software configuration is cuda9.0, cudnn7, mxnet1.3.0, python2.7, ubuntu 16.04.4, kernel 4.4.0-130-generic.

There are two different errors that might occur. Detailed error message is shown below and any help is appreciated!

terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:30:46] src/engine/./threaded_engine.h:379: array::at: __n (which is 1852990827) >= _Nm (which is 7)
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7fc666f18dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fc666f19938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7fc669ac6849]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7fc669add31b]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7fc669add58e]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7fc669ac578a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fc67e77fc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc69d0f16ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc69ce2741d]


Aborted (core dumped)

Another possible error:

    Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "server.py", line 118, in login
    login_res, message = face_verification(file_path, regis_path, username)
  File "server.py", line 14, in face_verification
    result, data = server_function.verify(embedding_dir, photo_dir, login_id)
  File "/home/wenbin/project/mxnet_faceID/server_function.py", line 88, in verify
    img_tmp = model.get_input(image)
  File "/home/wenbin/project/mxnet_faceID/face_model.py", line 71, in get_input
    ret = self.detector.detect_face(face_img, det_type = self.args.det)
  File "/home/wenbin/project/mxnet_faceID/mtcnn_detector.py", line 493, in detect_face
    output = self.LNet.predict(input_buf)
  File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/model.py", line 717, in predict
    o_list.append(o_nd[0:real_size].asnumpy())
  File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [09:02:17] src/operator/nn/./cudnn/cudnn_convolution-inl.h:156: Check failed: e == CUDNN_STATUS_SUCCESS (7 vs. 0) cuDNN: CUDNN_STATUS_MAPPING_ERROR

Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f85238d2dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f85238d3938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::op::CuDNNConvolutionOp<float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x389) [0x7f8527d34829]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::op::ConvolutionCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xbfc) [0x7f8527d29bec]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x59) [0x7f8525e763f9]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(+0x317c8d3) [0x7f8525e228d3]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f8526480185]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f852649731b]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f852649758e]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f852647f78a]


192.168.205.142 - - [30/Oct/2018 09:02:17] "POST /login HTTP/1.1" 500 -
[09:02:17] src/resource.cc:262: Ignore CUDA Error [09:02:17] src/storage/./pooled_storage_manager.h:85: CUDA: an illegal memory access was encountered

Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f85238d2dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f85238d3938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x95) [0x7f85264a3815]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFree(mxnet::Storage::Handle)+0x3d) [0x7f85264a61bd]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::StorageImpl::DirectFree(mxnet::Storage::Handle)+0x68) [0x7f852649f418]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::resource::ResourceManagerImpl::ResourceTempSpace::~ResourceTempSpace()::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0xff) [0x7f852656e90f]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(+0x37dfe01) [0x7f8526485e01]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f8526480185]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x65) [0x7f852649b085]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x1b0) [0x7f8526486400]

Ho do you call MXNet from your server script? Is each thread calling MXNet directly? If so, this could be the problem.
You can also find some more information here: https://github.com/apache/incubator-mxnet/issues/3946

Thank you for your reply.
What I have is a server script that take input image and user id, and it does not contain any MXNet stuff at all, but it imports another python file. The second python file contains model initialization (during importing stage, so the model is only initialized once and is always ready as long as the server is running), and functions that actually do inference to detect face and generate embedding.
I think this might be flask’s problem because I am not very familiar with flask , like how it handles requests, whether it create different threads to handle multiple requests (even if second request comes after the first request is finished).
I’ve looked into the link you provided, basically I have to ensure that only one thread uses the model and if multiple thread try to utilize the same model, there will be trouble.
I’ll get a bit more familiar with flask and see if I can solve this problem.
Once I’ve made any progress, I’ll let you know.

You are right, it is the problem of multithreading.

Flask by default uses multithreading. Once I set multithreading parameter to false, server script can work properly without any problem.

Thanks!

i have the same error, which is “mxnet.base.MXNetError: [20:27:51] src/operator/nn/./cudnn/cudnn_convolution-inl.h:159: Check failed: e == CUDNN_STATUS_SUCCESS (7 vs. 0) cuDNN: CUDNN_STATUS_MAPPING_ERROR”, and i do not run my program with multithreading. i have 2 scripts,one works fine and another got the above error,what dose that mean??

@hellyo can you provide a small reproducible example of your code that is crashing?