CNN and invariance to feature translation on the image

I’m playing with Convolutional NNs (C++) in order to apply them to some structural biology problems. I started with “classical” MNIST digit recognition based on lenet_with_mxdataiter.cpp example from the mxnet package.
My understanding is that CNN should be invariant to the translation of the feature inside the image. (in MNIST for example it should recognize not only digits which are centered by also shifted ones)
So, I’ve created a small additional test set where I took one image from the original test set and populated it by shifting digit into corners on the image. (I kept the same image size 28x28 for now)
When I run the example I consistently have very bad recognition performance on this modified set while centered (original) digit is always recognized correctly.

Any ideas or thoughts what am I missing? (I can provide a complete example if needed, but as I mentioned this is slight modification of the standard lenet_with_mxdataiter.cpp file from the package)


Convolutional masks are applied through a sliding window, which means wherever your object is in the feature map, it will generate the same activation, at that particular location of the feature map. That’s the translation invariant part. However if you maintain spatial resolution information by having feature maps bigger than 1x1 on your last layer before the fully connected layer, the fully connected layer will incorporate spatial resolution information in their weight.

If you look at the LeNet architecture, the last convolutional layer is made of 5x5 feature maps. Which means if your letter has been heavily shifted to a given side, the resulting activation will be on that side too. And the learned weights from your fully connected layers will not recognize this type of activations.

When you know that your data has some spatial information, for example, object are centered, it is good to not go all the way to 1x1 features maps, to make your network able to capture and use this information. But if you want a truly spatially independent network, go for the smallest feature maps possible before fully connected layer, down to 1x1 feature maps.

Alternatively if you randomly add translation as a data augmentation to your data, your network will learn to be more robust to spatial transformations.