Issue: Big acc difference when fine-tune using vgg16 and resnet-50

You can check your network architecture like that:

mx.viz.print_summary(sym)
________________________________________________________________________________________________________________________
Layer (type)                                        Output Shape            Param #     Previous Layer                  
========================================================================================================================
data(null)                                                                  0                                           
________________________________________________________________________________________________________________________
conv1_1(Convolution)                                                        64          data                            
________________________________________________________________________________________________________________________
relu1_1(Activation)                                                         0           conv1_1                         
________________________________________________________________________________________________________________________
....
________________________________________________________________________________________________________________________
fc7(FullyConnected)                                                         4096        drop6                           
________________________________________________________________________________________________________________________
relu7(Activation)                                                           0           fc7                             
________________________________________________________________________________________________________________________
drop7(Dropout)                                                              0           relu7                           
________________________________________________________________________________________________________________________
fc8(FullyConnected)                                                         1000        drop7                           
________________________________________________________________________________________________________________________
prob(SoftmaxOutput)                                                         0           fc8                             
========================================================================================================================
Total params: 13416
________________________________________________________________________________________________________________________

You want to use relu7 for your fine-tuning layer.

(new_sym, new_args) = get_fine_tune_model(sym, arg_params, num_classes, layer_name='relu7')

Result:

2018-05-28 18:57:46,908 Epoch[7] Batch [230]	Speed: 441.92 samples/sec	accuracy=0.884375
2018-05-28 18:57:48,405 Epoch[7] Batch [240]	Speed: 427.60 samples/sec	accuracy=0.934375
2018-05-28 18:57:48,406 Epoch[7] Train-accuracy=0.934375
2018-05-28 18:57:48,406 Epoch[7] Time cost=35.405
2018-05-28 18:58:00,677 Epoch[7] Validation-accuracy=0.725211
1 Like

Yes! Thanks Tom!
Using relu7 with sgd and lr=0.001 works.

But I am curious why this happens?
I’ve add a dense layer (to num_classes). Let former network be F.
I believe that there are 3 dense layers : fc6, fc7 and fc8

relu just adds some no-linear transformation, so relu7_ouput = relu(F(x).fc7_output)
and then final_output = softmax(W*relu7_output+b)

just using fc7, the final_output = softmax(W*fc7_output+b)

what I believe is that : just using fc7, the final 2 layers can be seen as a BIG W, so it should work.
But the fact is using relu7 works
:frowning:

Previously you were using flatten0 which is before any dense layers.
Then you have fc6 and fc7 which are fully connected layers with 4096 hidden units each
fc8 is actually the classification layer which has 1000 (as the number of classes in ImageNet 1k) units.

If you use flatten0, and then add your own classification layer, you are not using fc6 and fc7. For example just looking at fc7, it has ~16M parameters. So you are throwing a lot of pre-trained information there if you use flatten0 rather than relu7.

Thanks for the explaination.
So now my understanding is that the right way to fine-tune is to only replace the last classification dense layer(fc8).
If using flatten0 or fc6 or fc7, the new-added layer doesn’t have that much ability to transform the information from the upper(or former) layers to minor(compare to 1k) classes. Tested with relu6, bad performance.
So I believe that VGG team should have tested how many dense layers and neuros would work best (at least for ImageNet). (I haven’t seen any explaination about how to and why choose the last 3 dense layers in the original paper:https://arxiv.org/pdf/1409.1556.pdf. Maybe I just didn’t get it)