[solved] Network in float16


I have a working network that processes images in float32, using the C++ Symbol API. I now try to convert the network in processing in float16 (aka half_float). I am using the GPU for the computations.
After having some errors saying that convolutions or batchnormalization (for instance) can’t have mixed input type, I converted every input (including the kernel weights, biases, means, etc) to float16, using the “Cast” Symbol. However, I now get “Check failed: e.node->is_variable() Mutation target can only be Variable”. So I conclude that the kernel symbol, which is a variable Symbol mapped to a NDArray, can’t be casted to float16. And I don’t find anything to directly feed the data in float16 (even in the NDArray, I can’t find such things)
But then, how can I do?


Hi @dmidge, unfortunately there are no documented way to do inference in fp16 in CPP at the moment. I know it’s a pretty big gap, I hope it gets closed soon. Please register your interest on this github issue: https://github.com/apache/incubator-mxnet/issues/14159

I think to do inference in fp16 what you can do is in python do:

import mxnet as mx
from mxnet import gluon

net = gluon.model_zoo.vision.resnet18_v2(pretrained=True, ctx=ctx)
export_net = gluon.nn.HybridSequential()
with export_net.name_scope():
    export_net.add(gluon.nn.HybridLambda(lambda F, x: F.cast(x, 'float16')))
export_net(mx.nd.ones((1,3,224,224), ctx=ctx))
export_net.export('my_model', 0)

You can then use my_model-symbol.json and my_model-0000.params in CPP and feed that network fp32 data that will be converted to fp16 in the first layer of the network.

Note that fp16 is only supported in GPU for now.

edit: Just re-read your question, why you get these errors is that some layers, batch norm typically, are using fp32 for accumulation so not all parameters need to be converted to fp16.

1 Like

Hi @ThomasDelteil,

Thank you for this information. I indeed didn’t see any C++ tutorial about that, nor examples in the repository, so I suspected it was the reason. But thank you for this explanation! Despise my research, I missed the github feature request.
I added a comment pointing to this forum thread.


Thanks to your post, I think I managed to work (at least partially) with floating point 16. In my case, I needed to keep everything in the batchnorm in float32. However, I don’t fully understand why I have no control over the type I should use for the batchnorm. But at least, this matter seems solved.

1 Like

I am working on offline conversion of fp32 to fp16 and should be built on top of AMP support: https://github.com/apache/incubator-mxnet/issues/14584 . This will also add support for CPP API and other frontends. Stay tuned.

1 Like


Perfect, thank you!