TensorRT test letnet5 onnox parser error

Hi, I installed mxnet 1.5.0 with TensorRT support and run test in python in incubator-mxnet/tests/python/tensorrt/

Platform:
Ubuntu 18.04, CUDA 10.1, mxnet 1.5.0, Tensorrt 5.0, GTX 1060ti.

While I met error when running test_tensorrt_lenet5.py. I already run lenet5_train.py to get the model json and params. The error message is as follows:

[21:39:50] /home/username/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:686: start to execute partition graph.
E
======================================================================
ERROR: Run LeNet-5 inference comparison between MXNet and TensorRT.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/username/.local/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "test_tensorrt_lenet5.py", line 93, in test_tensorrt_inference
    batch_size=batch_size, use_tensorrt=True)
  File "test_tensorrt_lenet5.py", line 48, in run_inference
    force_rebind=True)
  File "../incubator-mxnet/python/mxnet/symbol/symbol.py", line 1629, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (128, 1, 28, 28)
softmax_label: (128,)
force_rebind: True
Cannot parse ONNX into TensorRT Engine
-------------------- >> begin captured stdout << ---------------------
LeNet-5 test
Running inference in MXNet

--------------------- >> end captured stdout << ----------------------
-------------------- >> begin captured logging << --------------------
root: INFO: train-labels-idx1-ubyte.gz exists, skipping download
root: INFO: train-images-idx3-ubyte.gz exists, skipping download
root: INFO: t10k-labels-idx1-ubyte.gz exists, skipping download
root: INFO: t10k-images-idx3-ubyte.gz exists, skipping download
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 6.489s

FAILED (errors=1)

It looks like it cannot parse onnx models. The problem is that I successfully run the test test_resnet18.py in the same folder and everything looks good. Are there any suggestions? Thanks a lot!

@kellen

Thanks for the report. You’re probably ok to try and run inference with your models if resnet18 is passing. Can you give a few other models a shot and see if you have any issues?

The error with LeNet looks like it should be fairly reproducible. I’ll try and dig in to what the root cause is on that when I’ve got some free time (others feel free to investigate).

@kellen Thanks for the reply!

Yeah there’re two issues that I met.

  1. It said that TensorRT doesn’t support FP16:
[11:23:38] /home/usrname/incubator-mxnet/src/operator/subgraph/tensorrt/onnx_to_tensorrt.cc:136: TensorRT can't use fp16 on this platform

Do we support fp16 now? Or is it a hardware support problem?

  1. I change to use gluoncv now for model loading, and when I test ssd with model name ssd_512_resnet50_v1_coco, I got the following error:
[11:24:07] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:686: start to execute partition graph.
[libprotobuf ERROR google/protobuf/io/coded_stream.cc:207] A protocol message was rejected because it was too big (more than 67108864 bytes).  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
Traceback (most recent call last):
  File "../incubator-mxnet/python/mxnet/symbol/symbol.py", line 1623, in simple_bind
    ctypes.byref(exe_handle)))
  File "../incubator-mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Could not parse ONNX from string

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_resnet18.py", line 107, in <module>
    test_tensorrt_resnet18_feature_vect(model_name)
  File "test_resnet18.py", line 71, in test_tensorrt_resnet18_feature_vect
    grad_req='null', force_rebind=True)
  File "../incubator-mxnet/python/mxnet/symbol/symbol.py", line 1629, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 512, 512)
force_rebind: True
Could not parse ONNX from string

It looks like the model size is too big, are there any suggestions to change the protobuf limit? Thanks in advance!

@kellen Hi, I just found one more serious issue for maskrcnn. The model name is mask_rcnn_resnet18_v1b_coco from gluoncv. The error message is as follows:

[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:686: start to execute partition graph. 
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__minus0. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__minus0. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0_concat0, maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__plus0. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__plus0. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0_concat0, maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__minus1. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
[14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:300: Found a cycle when BFS from node maskrcnn0_rpn0_bboxcornertocenter0__minus1. Excluding nodes maskrcnn0_rpn0_bboxcornertocenter0_concat0, maskrcnn0_rpn0_bboxcornertocenter0__plus1, and retrying
Traceback (most recent call last):
  File "test_resnet18.py", line 107, in <module>
    test_tensorrt_resnet18_feature_vect(model_name, batch_shape)
  File "test_resnet18.py", line 65, in test_tensorrt_resnet18_feature_vect
    trt_sym = sym.get_backend_symbol('TensorRT')
  File "../incubator-mxnet/python/mxnet/symbol/symbol.py", line 2564, in get_backend_symbol
    check_call(_LIB.MXGenBackendSubgraph(self.handle, c_str(backend), ctypes.byref(out)))
  File "../incubator-mxnet/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [14:02:47] /home/usrname/incubator-mxnet/src/operator/subgraph/build_subgraph.cc:258: Check failed: excluded_node_id != static_cast<int>(snid) (152 vs. 152) : A cycle is found in the computational graph between nodes maskrcnn0_rpn0_bboxcornertocenter0__plus1 and maskrcnn0_rpn0_bboxcornertocenter0__plus1
Stack trace:
  [bt] (0) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x57) [0x7f9485773f47]
  [bt] (1) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::sg::LabelSubgraph(nnvm::Graph const&, std::shared_ptr<mxnet::op::SubgraphSelectorV2>, int, unsigned long, std::vector<std::shared_ptr<mxnet::op::BiDirectedNode>, std::allocator<std::shared_ptr<mxnet::op::BiDirectedNode> > > const&, std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> >*, std::unordered_set<mxnet::op::BiDirectedNode const*, std::hash<mxnet::op::BiDirectedNode const*>, std::equal_to<mxnet::op::BiDirectedNode const*>, std::allocator<mxnet::op::BiDirectedNode const*> >*)+0x1fa3) [0x7f9486a466b3]
  [bt] (2) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::sg::PreSelectSubgraphNodes(nnvm::Graph const&, std::shared_ptr<mxnet::op::SubgraphSelectorV2>, int, unsigned long, std::vector<std::shared_ptr<mxnet::op::BiDirectedNode>, std::allocator<std::shared_ptr<mxnet::op::BiDirectedNode> > > const&, std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> >*)+0x18a) [0x7f9486a4798a]
  [bt] (3) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::sg::SelectSubgraphNodes(nnvm::Graph*, std::shared_ptr<mxnet::op::SubgraphSelectorV2>, std::vector<std::shared_ptr<mxnet::op::BiDirectedNode>, std::allocator<std::shared_ptr<mxnet::op::BiDirectedNode> > > const&, std::vector<std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> >, std::allocator<std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> > > >*, std::vector<std::shared_ptr<mxnet::op::SubgraphSelectorV2>, std::allocator<std::shared_ptr<mxnet::op::SubgraphSelectorV2> > >*, mxnet::op::BiDirectedNode const*, unsigned long, unsigned long*)+0x15e) [0x7f9486a4847e]
  [bt] (4) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::sg::FindSubgraphs(nnvm::Graph*, mxnet::op::SubgraphProperty const&, std::vector<std::shared_ptr<mxnet::op::BiDirectedNode>, std::allocator<std::shared_ptr<mxnet::op::BiDirectedNode> > > const&, std::vector<std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> >, std::allocator<std::vector<mxnet::op::BiDirectedNode*, std::allocator<mxnet::op::BiDirectedNode*> > > >*, std::vector<std::shared_ptr<mxnet::op::SubgraphSelectorV2>, std::allocator<std::shared_ptr<mxnet::op::SubgraphSelectorV2> > >*)+0x49a) [0x7f9486a48ffa]
  [bt] (5) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::BuildSubgraph(nnvm::Graph&&)+0x3c7) [0x7f9486a4c3a7]
  [bt] (6) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<nnvm::Graph (nnvm::Graph), nnvm::Graph (*)(nnvm::Graph&&)>::_M_invoke(std::_Any_data const&, nnvm::Graph&&)+0x29) [0x7f9485c4bcd9]
  [bt] (7) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x43a) [0x7f9488018b1a]
  [bt] (8) /home/usrname/incubator-mxnet/python/mxnet/../../build/libmxnet.so(nnvm::ApplyPass(nnvm::Graph, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x150) [0x7f94857fb4b0]

There seems to be something wrong in subgraph schedule.

I just opened an issue for more detailed information on github.