Help with SSD SmoothL1 metric reporting NaN during training

Greetings everyone,

I apologize in advance for any inconvenience as this is my first post.

I am trying to train an SSD Model from GluonCV on a custom dataset, created as an LSTRecord

I am referencing:

I am encountering an issue whereby the SmoothL1 metric used in [2] is reporting Nan; my model
is unable to detect my target object in a preliminary test.

To diagnose the issue, I tried printing out the anchor boxes generated by this snippet of code in [2]:

def get_dataloader(net, train_dataset, data_shape, batch_size, num_workers):
from import Tuple, Stack, Pad
from import SSDDefaultTrainTransform
width, height = data_shape, data_shape
# use fake data to generate fixed anchors for target generation
with autograd.train_mode():
_, _, anchors = net(mx.nd.zeros((1, 3, height, width)))
batchify_fn = Tuple(Stack(), Stack(), Stack()) # stack image, cls_targets, box_targets
train_loader =
train_dataset.transform(SSDDefaultTrainTransform(width, height, anchors)),
batch_size, True, batchify_fn=batchify_fn, last_batch=‘rollover’, num_workers=num_workers)
return train_loader

train_data = get_dataloader(net, dataset, 512, 16, 0)

the anchors were reporting Nan in the last dimension of the bounding box coordinate e.g.
[48.437504 29.437502 10.88711 nan]

Is anyone able to advise on a way to resolve this NaN issue?

As an interim solution, I am looking to generate the anchor boxes in the manner described in [1] but it lacks the OHEMSampler used by the SSDTargetgenerator, located in SSDDefaultTrainTransform which i am concerned might affect my model’s performance.

set multi_precision=True in your optimizer
Don’t know why it help…but it did work for me.

Hi @Neutron , thanks for your reply!

Unfortunately, despite setting multi_precision=True in my optimizer by modifying [2] as:

trainer = gluon.Trainer(
{‘learning_rate’: 0.001,‘multi_precision’: True, ‘wd’: 0.0005, ‘momentum’: 0.9})

I was not able to resolve the issue.

Maybe using a small scale to intilize your parameters would help.

net.initialize(mx.init.Uniform(scale=0.01), ctx=ctx)

It is better to try different settings for a net.
For me, just using

trainer = mx.gluon.Trainer(
    optimizer_params={'beta1':0.9, 'beta2':0.99, 'epsilon':1e-09, 'schedule_decay':0.004,'multi_precision':True})


net.initialize(mx.init.Xavier(), ctx=ctx)
net.collect_params('.*bias').initialize(mx.init.LSTMBias(forget_bias=1.0),ctx)# for LSTM only. For other bias, using mx.init.Zero()

the model fits very well

Hi @Neutron, thanks for your reply!

It seems that my dataset had issues with its ground truth labels, and was the cause of problem.

However, I will take note of your parameter initialization suggestion, should i run into further issues
with NaN values.

Thank you very much for your help, and for responding nonetheless! :slight_smile:

Dear Lee, I have the same problem for trying fine tuned ssd code with my custom dataset. While I was preparing .lst file I wrote the lines as following for images that have no objects:
idx 4 5 512 512 -1.0 -0.001953125 -0.001953125 -0.001953125 -0.001953125 img_path

I added class id as -1 and added bbox as [-1, -1, -1, -1] (above are normalized). How did you solve the issues with ground truth label, especially for images that have no objects? By the way I tried also the following format for no objects:
0 4 0 512 512 img_path
However it gives error about labeling in ssd code. Thanks so much for any help!