Multiscale Object Detection


in display_anchors, should it not be:

fmap = nd.zeros((1, 10, fmap_h, fmap_w))

so fmap_w and fmap_h are flipped?

This confused me quite a bit. Images and feature maps are encoded (n, c, h, w), so height by width, meaning that you index them as (y, x).
But then, your boxes are indexed (x,y), so (x, y, w, h).

This means the input to MultiBoxPrior is (h, w), but its output is (w, h).

It may be worth commenting on this flipping between heigth and width somewhere, it is confusing.

@mseeger I agree it’s confusing, moving between the NDArray tensor definition h,w and the point-style definition with x,y.

Looking at the MultiboxPrior operator definition, it looks like they use the convention of width first, then height. Which is quite non-standard indeed as if you reference to your coordinate as the height and width, it is usually in the height, width order.