We know that CUDNN_AUTOTUNE_DEFAULT is by default set to 1. As documented here: https://mxnet.incubator.apache.org/faq/env_var.html, when this is set to 1 MXNet chooses the best/fastest algo to run for Convolution/DeConvolution operators by running performance tests. Once an algo is chosen it is cached according to specific input shape, output shape or weight shape among other factors (compute dtypes, compute capability etc.). This prevents rerun of the performance tests when the same input shape, output shape and weight shape is used. But the algo selection would be triggered again for a combination of input_shape, weight shape and output_shape that wasn’t seen before.

For a use case where the input shape , output shape and weight shape are very varied for different forward calls, the CUDNN Algo selection and performance tests run too often and become a performance bottleneck. In this case, we found the performance (latency per forward call) was much better with CUDNN_AUTOTUNE_DEFAULT set to 0.

Starting this thread to capture other use cases where it was better to set CUDNN_AUTOTUNE_DEFAULT to 0.

Thanks for reporting this @anirudh2290, I have had the same experience as well, running the benchmark can slow down the overall execution time.