Convolutional Neural Networks (LeNet)


First of all, thank you for a great learning material!

In the chapter about LeNet architecture you mention that your implementation matches the historical definition of Lenet5 (Gradient-Based Learning Applied to Document Recognition) except the last layer, but I found two other inconsistencies in subsection B. LeNet-5.

  • LeNet paper does not describe pooling layer as an average pooling layer, but rather as layer that perform summation over 2x2 neighborhood within input activation feature map, then multiply it with trainable weight, add trainable bias and finally pass it through sigmoidal function.

  • According to LeNet paper, the activation function used at both convolution and fully connected layers is scaled hyperbolic tangent function, not sigmoid as is used in code. These two functions looks similar but have different output range (

If there is something I missed and your implementation of LeNet5 is correct, please let me know.


Hey Martin,

Pooling was called sub-sampling in the original paper. According to the pg6 on the paper

"This can be achieved
with a socalled subsampling layers which performs a local
averaging and a subsampling reducing the resolution of
the feature map and reducing the sensitivity of the output
to shifts and distortions"

Also, for tanh vs sigmoid, it seems that tanh converges faster than sigmoid (especially useful in 20 years ago when compute power is not strong enough).

Hopefully it helps!

Just want to point out that the link to the Multilayer Perceptron is no longer available for this page in the book.

Thanks. Please refer to


Thanks for the learning material!

But I have some problem when using the code.

# Save to the d2l package.
def train_ch5(net, train_iter, test_iter, num_epochs, lr, ctx=d2l.try_gpu()):
    net.initialize(force_reinit=True, ctx=ctx, init=init.Xavier())
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    trainer = gluon.Trainer(net.collect_params(),
                            'sgd', {'learning_rate': lr})
    animator = d2l.Animator(xlabel='epoch', xlim=[0,num_epochs],
                            legend=['train loss','train acc','test acc'])
    timer = d2l.Timer()
    for epoch in range(num_epochs):
        metric = d2l.Accumulator(3)  # train_loss, train_acc, num_examples
        for i, (X, y) in enumerate(train_iter):
            # Here is the only difference compared to train_epoch_ch3
            X, y = X.as_in_context(ctx), y.as_in_context(ctx)
            with autograd.record():
                y_hat = net(X)
                l = loss(y_hat, y)
            metric.add(l.sum().asscalar(), d2l.accuracy(y_hat, y), X.shape[0])
            train_loss, train_acc = metric[0]/metric[2], metric[1]/metric[2]
            if (i+1) % 50 == 0:
                animator.add(epoch + i/len(train_iter),
                             (train_loss, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch+1, (None, None, test_acc))
    print('loss %.3f, train acc %.3f, test acc %.3f' % (
        train_loss, train_acc, test_acc))
    print('%.1f exampes/sec on %s'%(metric[2]*num_epochs/timer.sum(), ctx))

when I try to run
train_ch5(net, train_iter, test_iter, num_epochs, lr)
there is always the traceback

Traceback (most recent call last):
  File "", line 63, in <module>
    train_ch5(net, train_iter, test_iter, num_epochs, lr)
  File "", line 50, in train_ch5
    metric.add(l.sum().asscalar(), d2l.accuracy(y_hat, y), X.shape[0])
TypeError: add() takes 2 positional arguments but 4 were given

But since the code has already use metric = d2l.Accumulator(3), how could it happen that add() only takes 2 arguements?

I just rerun it and there was no error. This issue might cause by the new version of MXNet operators. Did you install the numpy version of MXNet? If not, please refer to

In the implementation of the function evaluate_accuracy_gpu, can we replace
ctx = list(net.collect_params().values())[0].list_ctx()[0]
simply by
ctx = net[0].weight.list_ctx()[0] ?

@gold_piggy @mli
I think there is an error in the description about the output shape of 1st conv layer.
In the end of section 6.1.1,

The convolutional layer uses a kernel with a height and width of 5, which with only 2 pixels of padding in the first convolutional layer and none in the second convolutional layer leads to reductions in both height and width by 2 and 4 pixels, respectively.

the 1st conv layer actually has 2 pixel padding on both side of input so I think there is no reduction on the 1st conv output (28 x 28).

Why does so much of this rely on the d2l package? This serves to make the lessons far less general.

Hi everyone!

First of all I’m really glad this forum exists! This is my first post and I’m looking forward to learn :slight_smile:

In this architecture, we got an output channel of 10.
I’ve checked the MNIST dataset and this would fit well to the total of classes:

  • 0 T-shirt/top
  • 1 Trouser
  • 2 Pullover
  • 3 Dress
  • 4 Coat
  • 5 Sandal
  • 6 Shirt
  • 7 Sneaker
  • 8 Bag
  • 9 Ankle boot

As one of the tips in the exercises, one should try to increase the output channel to more than that which I did. As a result, I have actually way better results than before. But what is actually happening here?

When I have 10 classes, each output channel is linked to one class – correct?
But If I have 20 classes, would that mean that each class can have an arbitrary number of output channel (>1)?

Thank you and I hope my question doesn’t sound too dumb