Prune a trained model and retrain

Hi all,

I have a trained network (in fp32) and I want to optimize it for mobile device.

I tried to do int8 quantization using ncnn platform, it can bring 30% speedup. But it is not very impressive and it has to use floating point operations for the first and the last layer, otherwise the performance drop is massive. (By the way, will full int8 computation harm the performance so bad? the model size is around 20MB and I see similar sized model gives good full-int8 performance) So I’m now considering pruning my model.

I’ve gone through this forum and the only information about prune is that there is several pruned resnet model using gluon api. However, my model uses module api and it is not exactly a ResNet structure. So is there any guide for pruning a trained model (using module api) and then retrain it?

Moreover, what is the order of quantization and prune? Quantization first or prune first?

Any help or discussion is appreciated, Thanks.


Model pruning is very much an art rather than a science, so if you wanted to prune a custom model of your own it would take a bit of time to meddle around because you can’t just port the learnings from a paper. However, here’s a paper you that delves a little bit into it.

With regards to order, I would say prune first and then quantization.


Thank you for your reply. I’ve read papers regarding pruning and quantization but not this one. I’ll take a close look at it:)


I now understand what is the operations that needs to be taken to prune a trained model and then retrain. Now the problem becomes how, how to implement those operations.

The only related coding resource I can find is the DSD training example on MLP. What would you suggest? Treat that as a starting point to start from somewhere else?

Many thanks:)

Seems that DSD training’s final product is also a dense network, so it is dead end.

Any other suggestions?

A different approach is to a select an optimized architecture of your choice for mobile devices, like:

Afterwards, make use of network distillation by transferring the knowledge of your current trained network:

This is done by training against the final feature representation of your current model instead of the actual labels which typically yields higher performance.

Thank you for your reply.

During development, I already used compact models. And when applying compact models to embedded devices, using those dense models are not enough. Normal approach seems to be pruning and quantization.

And distillation seems to be one of the compact dense models, so it might not be good enough.

Anyway, thank you so much for your advice:)