Inference performance with fp16 float16 GTX 1080 ti

OliverColeman · February 21, 2019, 5:14am

I’m unable to achieve any performance increase when using fp16 for inference. Some details:
Model: Gluon-based ResNet FCN (see here)
GPU: GTX 1080 ti
mxnet version: cu92-1.3.1
Cuda version: 9.2
Cudnn version: 7

Wondering what I’m missing…

My code is:

...
net = instantiate_model(model, params, stage, class_count, checkpoint_filepath, exec_contexts)
# net is a subclass of mxnet.gluon.Block, with context == gpu(0)
net.cast(np.float16)
...

        # images is list of PIL images.
        image_count = len(images)
        original_size = images[0].size
        batch_data = [] 
        for idx, image in enumerate(images):
            # Remove alpha channel if necessary.
            if image.mode == 'RGBA':
                r, g, b, a = image.split()
                image = Image.merge('RGB', (r, g, b))
            image = np.array(image).astype(np.float32)
            image = mx.nd.array(image, ctx=self.exec_contexts[0]) # self.exec_contexts[0] == gpu(0)
            image = mx.image.color_normalize(image, self.channelMeans, self.channelStdDevs)
            image = mx.nd.transpose(image, (2, 0, 1)) # (h, w, c) => (c, h, w)
            image = image.astype(np.float16).expand_dims(axis=0)
            batch_data.append(image)
        
        batch_data = mx.ndarray.concat(*batch_data, dim=0)
        
        pred = self.net(batch_data)
        
        # Same resample function as used in training/validation.
        pred = mx.nd.contrib.BilinearResize2D(pred, original_size[1], original_size[0])
        
        labels = []
        for idx in range(image_count):
            label = np.uint8(np.squeeze(pred[idx].asnumpy().argmax(axis=0)))
            label = Image.fromarray(label)
            label.putpalette(self.palette)
            labels.append(label)

ThomasDelteil · February 21, 2019, 8:07pm

Hi @OliverColeman,

I would suggest to test your network in isolation with and without float16. The lack of improvement might be hiding the fact that most of the computation is spent elsewhere.

Note that only some architecture have the NVIDIA “Tensor Cores” which benefits most from fp16 optimization. Check if your card have them.

Can you run the script below and report the times?

# fp32
import time
net.hybridize(static_alloc=True, static_shape=True)
data = mx.nd.ones((1,3,500,500), ctx=mx.gpu())
net(data).wait_to_read() #warmup
tic = time.time()
for i in range(100): 
    out = net(data)
    out.wait_to_read()
out.asnumpy() # Check for crashes
print('fp32', time.time()-tic)

# fp 16
net.cast('float16')
net.hybridize(static_alloc=True, static_shape=True)
data =  mx.nd.ones((1,3,500,500), ctx=mx.gpu(), dtype='float16')
net(data).wait_to_read()
tic = time.time()
for i in range(100): 
    out = net(data)
    out.wait_to_read()
out.asnumpy() # check for crashes
print('fp16', time.time()-tic)

OliverColeman · February 21, 2019, 11:15pm

Hi @ThomasDelteil,
Thanks for rapid response. I ran the suggested script. The numbers I get:

fp32 7.452287197113037
fp16 6.908765554428101

I was looking at this article but didn’t read far enough to get to the bit where it says that although the Pascal architecture used in the GTX1080 does technically natively support fp16, the actual Pascal chip used (GP104) in the 1080 has a severely limited number of fp16 cores. Pascal actually uses a 16/32 bit core that can, with the right software support, perform two 16fp ops in parallel, but GP104 has very few of these, whereas the Pascal-based GP100 used for HPC cards uses them throughout. So, apologies for not investigating more thoroughly whether my hardware would even properly support fp16.a

Topic		Replies	Views
Is there a flag to enable tensor cores? No speedup from FP16 Performance	4	693	March 20, 2020
FPS for object detection inference on GPU Gluon	6	1634	January 28, 2020
Accelerating FP16 Inference on Volta Performance unix-based , general-question	7	2742	February 15, 2019
Inconsistent results on GPU Discussion	0	315	March 20, 2020
Gluon val accuracy VERY different on CPU than GPU? Gluon	2	1339	May 28, 2019

Inference performance with fp16 float16 GTX 1080 ti

Related Topics