Weight Decay


In train(lambd) as well as train_gluon(wd), animator.add(epoch+1, …) should be changed into animator.add(epoch, …) because epoch starts from 1 in the for loop.

In 4.5.1, the stochastic gradient descent updates is a bit strange, shoudn’t the decay rate of w controlled by \lambda alone, why is batch size involved here?

I’d like to point out thtat There are a typos, as circled out by red line in the right bottom area of the screenshot below: