Improve performance in batch mode

In single GPU setup is it possible to improve performance of deploy mode (only forward pass) by feeding input vectors in batch mode? When using CPU I can improve performance in batch mode by using thread pool and or SIMD enabled libraries (such as Intel MKL).

You can cast your whole model and the data to “float16”, that’d give an almost 1.5 times performance improvement.
For more performance tips you can checkout this

Recently @thomelane has created an awesome tutorial regarding Tips and Tricks for Performance, you check it out here