Multiple weight decay rates


What is the best way (and is it possible) to define two different weight decay rates for two different layers (or blocks)? I don’t want to apply regularization to the first part of the network, but I’d like to have it in the second part.


weight decay multiplier is a setting on a parameter, which is applied on top of the global wd parameter in optimizer. you can set it in two ways:

  1. for custom block, when creating a new parameter, set wd_mult in get() (e.g. get(‘weight’, init=…, wd_mult=…))
  2. use setattr on a pre-defined block’s parameter. (e.g. dense.weight.wd_mult = …). you can also use parameter dict’s global setattr method to set more than one parameter at once (e.g. rnn.collect_params(’*_i2h_weight’).setattr(‘wd_mult’, …))
1 Like

Great, thanks for your help!

Does it mean the weight decay multiplier will override the global wd parameter?

It is multiplied by the WD value.

1 Like