Hi all,
I was running the insightface model from github with pretrained weights on inference path. I write my own code detecting and recognizing 300 photos inside a directory and it works fine. However, when I tried to call the detection and recognition function from a server script, it only worked once for the first call, the second detection call will result in an error and the entire model crashes.
My software configuration is cuda9.0, cudnn7, mxnet1.3.0, python2.7, ubuntu 16.04.4, kernel 4.4.0-130-generic.
There are two different errors that might occur. Detailed error message is shown below and any help is appreciated!
terminate called after throwing an instance of 'dmlc::Error'
what(): [16:30:46] src/engine/./threaded_engine.h:379: array::at: __n (which is 1852990827) >= _Nm (which is 7)
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 9 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7fc666f18dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7fc666f19938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7fc669ac6849]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7fc669add31b]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7fc669add58e]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7fc669ac578a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fc67e77fc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc69d0f16ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc69ce2741d]
Aborted (core dumped)
Another possible error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "server.py", line 118, in login
login_res, message = face_verification(file_path, regis_path, username)
File "server.py", line 14, in face_verification
result, data = server_function.verify(embedding_dir, photo_dir, login_id)
File "/home/wenbin/project/mxnet_faceID/server_function.py", line 88, in verify
img_tmp = model.get_input(image)
File "/home/wenbin/project/mxnet_faceID/face_model.py", line 71, in get_input
ret = self.detector.detect_face(face_img, det_type = self.args.det)
File "/home/wenbin/project/mxnet_faceID/mtcnn_detector.py", line 493, in detect_face
output = self.LNet.predict(input_buf)
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/model.py", line 717, in predict
o_list.append(o_nd[0:real_size].asnumpy())
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/wenbin/.local/lib/python2.7/site-packages/mxnet/base.py", line 210, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [09:02:17] src/operator/nn/./cudnn/cudnn_convolution-inl.h:156: Check failed: e == CUDNN_STATUS_SUCCESS (7 vs. 0) cuDNN: CUDNN_STATUS_MAPPING_ERROR
Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f85238d2dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f85238d3938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::op::CuDNNConvolutionOp<float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x389) [0x7f8527d34829]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::op::ConvolutionCompute<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xbfc) [0x7f8527d29bec]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x59) [0x7f8525e763f9]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(+0x317c8d3) [0x7f8525e228d3]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f8526480185]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f852649731b]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f852649758e]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f852647f78a]
192.168.205.142 - - [30/Oct/2018 09:02:17] "POST /login HTTP/1.1" 500 -
[09:02:17] src/resource.cc:262: Ignore CUDA Error [09:02:17] src/storage/./pooled_storage_manager.h:85: CUDA: an illegal memory access was encountered
Stack trace returned 10 entries:
[bt] (0) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f85238d2dcb]
[bt] (1) /home/wenbin/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f85238d3938]
[bt] (2) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x95) [0x7f85264a3815]
[bt] (3) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFree(mxnet::Storage::Handle)+0x3d) [0x7f85264a61bd]
[bt] (4) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::StorageImpl::DirectFree(mxnet::Storage::Handle)+0x68) [0x7f852649f418]
[bt] (5) /home/wenbin/mxnet/lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::resource::ResourceManagerImpl::ResourceTempSpace::~ResourceTempSpace()::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0xff) [0x7f852656e90f]
[bt] (6) /home/wenbin/mxnet/lib/libmxnet.so(+0x37dfe01) [0x7f8526485e01]
[bt] (7) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f8526480185]
[bt] (8) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x65) [0x7f852649b085]
[bt] (9) /home/wenbin/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x1b0) [0x7f8526486400]