Inference performance with fp16 float16 GTX 1080 ti

I’m unable to achieve any performance increase when using fp16 for inference. Some details:
Model: Gluon-based ResNet FCN (see here)
GPU: GTX 1080 ti
mxnet version: cu92-1.3.1
Cuda version: 9.2
Cudnn version: 7

Wondering what I’m missing…

My code is:

...
net = instantiate_model(model, params, stage, class_count, checkpoint_filepath, exec_contexts)
# net is a subclass of mxnet.gluon.Block, with context == gpu(0)
net.cast(np.float16)
...

        # images is list of PIL images.
        image_count = len(images)
        original_size = images[0].size
        batch_data = [] 
        for idx, image in enumerate(images):
            # Remove alpha channel if necessary.
            if image.mode == 'RGBA':
                r, g, b, a = image.split()
                image = Image.merge('RGB', (r, g, b))
            image = np.array(image).astype(np.float32)
            image = mx.nd.array(image, ctx=self.exec_contexts[0]) # self.exec_contexts[0] == gpu(0)
            image = mx.image.color_normalize(image, self.channelMeans, self.channelStdDevs)
            image = mx.nd.transpose(image, (2, 0, 1)) # (h, w, c) => (c, h, w)
            image = image.astype(np.float16).expand_dims(axis=0)
            batch_data.append(image)
        
        batch_data = mx.ndarray.concat(*batch_data, dim=0)
        
        pred = self.net(batch_data)
        
        # Same resample function as used in training/validation.
        pred = mx.nd.contrib.BilinearResize2D(pred, original_size[1], original_size[0])
        
        labels = []
        for idx in range(image_count):
            label = np.uint8(np.squeeze(pred[idx].asnumpy().argmax(axis=0)))
            label = Image.fromarray(label)
            label.putpalette(self.palette)
            labels.append(label)

Hi @OliverColeman,

I would suggest to test your network in isolation with and without float16. The lack of improvement might be hiding the fact that most of the computation is spent elsewhere.

Note that only some architecture have the NVIDIA “Tensor Cores” which benefits most from fp16 optimization. Check if your card have them.

Can you run the script below and report the times?

# fp32
import time
net.hybridize(static_alloc=True, static_shape=True)
data = mx.nd.ones((1,3,500,500), ctx=mx.gpu())
net(data).wait_to_read() #warmup
tic = time.time()
for i in range(100): 
    out = net(data)
    out.wait_to_read()
out.asnumpy() # Check for crashes
print('fp32', time.time()-tic)

# fp 16
net.cast('float16')
net.hybridize(static_alloc=True, static_shape=True)
data =  mx.nd.ones((1,3,500,500), ctx=mx.gpu(), dtype='float16')
net(data).wait_to_read()
tic = time.time()
for i in range(100): 
    out = net(data)
    out.wait_to_read()
out.asnumpy() # check for crashes
print('fp16', time.time()-tic)

Hi @ThomasDelteil,
Thanks for rapid response. I ran the suggested script. The numbers I get:

fp32 7.452287197113037
fp16 6.908765554428101

I was looking at this article but didn’t read far enough to get to the bit where it says that although the Pascal architecture used in the GTX1080 does technically natively support fp16, the actual Pascal chip used (GP104) in the 1080 has a severely limited number of fp16 cores. Pascal actually uses a 16/32 bit core that can, with the right software support, perform two 16fp ops in parallel, but GP104 has very few of these, whereas the Pascal-based GP100 used for HPC cards uses them throughout. So, apologies for not investigating more thoroughly whether my hardware would even properly support fp16.a

1 Like