Gluoncv SSD returns mxnet.base.MXNetError: Shape inconsistent between 2 epochs

Hi,
I’m training a gluoncv SSD. A very weird thing happens:
first epoch works fine
first batch of the second epoch returns:

Traceback (most recent call last):
  File "trash.py", line 308, in <module>
    sum_loss, cls_loss, box_loss = mbox_loss(cls_preds, box_preds, C, B)
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/block.py", line 548, in __call__
    out = self.forward(*args)
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/gluoncv/loss.py", line 156, in forward
    cls_loss = -nd.pick(pred, ct, axis=-1, keepdims=False)
  File "<string>", line 89, in pick
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Shape inconsistent, Provided = [10,6132], inferred shape=[10,100]

Where does that inferred shape=[10,100] comes from? Why would this happen between 2 epochs while the whole first epoch went fine??

wow how weird - I moved the net.hybridize(static_alloc=True, static_shape=True) to inside the epoch loop (instead of doing it just once, out of the epoch loop) and the error disappeared

I think this happened because you hybridized before you got the anchors. You need to hybridize the model AFTER you request the anchors. The call to get the anchors under the training scope of autograd triggers a different branch of the ssd model that returns the anchors. The anchors are used to compute the targets on CPU ahead of times, since they are deterministic based on the target and the anchors sizes.

1 Like