Zero-padded Shortcut Connection in Gluon/MXNET


in the paper "Deep Pyramidal Residual Networks" (, zero-padding of the feature channels was used in order to increase the dimensionality.

There are two sample code sources for PyramidNet:

        # [...]
        batch_size = out.size()[0]
        residual_channel = out.size()[1]
        shortcut_channel = shortcut.size()[1]

        if residual_channel != shortcut_channel:
            padding = torch.autograd.Variable(torch.cuda.FloatTensor(batch_size, residual_channel - shortcut_channel, featuremap_size[0], featuremap_size[1]).fill_(0)) 
            out +=, padding), 1)
out += shortcut

I’m wondering how you can do this in Gluon in an efficient way.
Should you create a custom operator which pads additional channels to the data or should one apply the element-wise addition on only a subset of the channels?

This question was also asked on Github 4 years ago:

@QueensGambit, I can’t think of a efficient way of doing it in a hybridizable way, apart from indeed concatenating with zeros that have been sized appropriately ahead of time as mentioned in the github issue.

res = nd.ones((64,64,50,50), ctx=ctx)
shortcut = nd.ones((64,32,50,50), ctx=ctx)
padding = nd.zeros((64,32,50,50), ctx=ctx)
out = res + nd.concat(shortcut, padding, dim=1)

Another solution that works and maybe simpler to test out but a bit slower:
The padding operator does not support padding on just the first axis, so you would need to transpose and then apply padding.

res = nd.ones((5,10,20,20))
shortcut = nd.ones((5,8,20,20))
out = res + mx.nd.pad(shortcut.transpose((0,2,1,3)), mode='constant', constant_value=0, pad_width=(0,0,0,0,0,2,0,0)).transpose((0,2,1,3))

If you don’t need a hybridizable network then it is easy, you can simply do something like that:

res = nd.ones((5,10,20,20))
shortcut = nd.ones((5,8,20,20))
res[:,:shortcut.shape[1], :,:] += shortcut 
1 Like

@ThomasDelteil Thank you very much for your reply.

Concatenating nd-Arrays to my output doesn’t work if I’m not mistaken since I’m dealing with neural network block outputs in a hybrid block.
So I would need to define an additional parameter to make this work:

self.zeroes_padding = self.params.get(
     'zeroes_padding', shape=zeroes_padding.shape,
     init=mx.init.Constant(zeroes_padding.asnumpy().tolist()), differentiable=False)

Inferring the current batch-size in gluon is also a bit more cumbersome:

I agree that designing this with MXNET-symbols is more principled in this case, especially if you consider applying Stochastic Depth, too:

1 Like

@QueensGambit correct for your first point, and your second point too. Having an operator do that would be the most efficient way of doing it, though it’s a bit trickier. Hopefully we get dynamic shape operators soon (it’s in the works) that will make a lot of these dances with the shapes easier.