Run-time discrepancies of v0.9.5 vs. 0.11.0 on TX2

I have a large discrepancy in ResNet-18 run-time on Jetson TX2 for MXNet v0.9.5 vs. v0.11.0 when running with batch=1: 91 ms/fr vs. 124 ms/fr. nvvp shows that the same conv kernels are called and they take ~2x longer compared in case of v0.11.0. Do you have an idea why that might happen?
Experimental setup:

  • Jetson TX2 with Jetpack 3.1 (cuDNN 6.0, CUDA 8.0)
  • ResNet-18 based on resnet.py symbol available from repo
  • input resolution 640x480
  • batch = 1

running with batch = 8 results into similar run-time in both cases

Could you please post a code snippet. That’ll help us figure out what is going on. Also, did you compare with the latest current version in Git?