SageMaker CPU Training: Gradient of Parameter `lstnet0_conv0_weight` on context cpu(1) has not been updated by backward since last `step`

Hey All,

I keep getting this error
“UserWarning: Gradient of Parameter lstnet0_conv0_weight on context cpu(1) has not been updated by backward since last step. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient”

This only happens when i use the SageMaker jobs with 1 or 2 CPU instances using the gluon Trainer API and calling it with sagemaker MXNet estimator but never happens when i use the GPU instance (i have only used one gpu instance) with the same code or training it locally in the sagemaker notebook instance.

The error occurs on the trainer.step line but i have no idea why its happening when the local as well as the GPU training works perfectly. Is there a bug in the code and how to debug this error?

Some Additional info:
Mxnet version: 1.2


    trainer = gluon.Trainer(net.collect_params(),
        optimizer_params={'learning_rate': hyperparameters['learning_rate'], 'clip_gradient': hyperparameters['clip_gradient']})

    batch_size = hyperparameters['batch_size']
    train_data_loader =
        ts_data_train.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')
    test_data_loader =
        ts_data_test.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')

    epochs = hyperparameters['epochs']
    print("Training Start")
    metric = mx.metric.RMSE()
    tic = time.time()
    for e in range(epochs):
        epoch_start_time = time.time()
        for data, label in train_data_loader:
            l1 = gluon.loss.L1Loss()
            data = data.as_in_context(ctx[0])
            label = label.as_in_context(ctx[0])
            with autograd.record():
                z = net(data)
                loss = l1(z,label)
            #trainer.step(batch_size, ignore_stale_grad=True)

Hi @Nell,

Can you provide the code where the context (ctx) is set?

I see that your context is a list becuase you’re using ctx[0]. But for CPU I’d expect ctx = mx.cpu(). Make sure you don’t have a context list with each of the cores (e.g. [mx.cpu(0), mx.cpu(1)]), but instead only mx.cpu(). You should install the MKL version of MXNet (installed with pip install mxnet-mkl) to use multiple CPU cores, which happens even when the context is set as mx.cpu().

Hey Thom,

Here is the code that goes right before the code above

ctx = [mx.cpu(i) for i in range(num_cpus)]
    if num_gpus > 0:
        ctx = ctx = [mx.gpu(i) for i in range(num_gpus)]
    print('Running on {}'.format(ctx))
    print('Hosts {}'.format(hosts))
    print('Current Host {}'.format(current_host))

    net = LSTNet(

    net.initialize(init=mx.init.Xavier(factor_type="in", magnitude=2.34), ctx=ctx)

    kvstore = 'local'
    if len(hosts) == 1:
        kvstore = 'device' if num_gpus > 0 else 'local'
        kvstore = 'dist_device_sync' if num_gpus > 0 else 'dist_sync'

    print('kvstore {}'.format(kvstore))
    store = kv.create(kvstore)
    print("Total number of workers: %d" % store.num_workers)
    print("This worker's rank: %d" % store.rank)

So this is where your issue is I think. You don’t need num_cpus. Just set ctx = mx.cpu() to use the CPUs. But even better than that, try this…

ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()

You didn’t have this issue on GPU instances because you’ve set the context accordingly for that case, but the line above does two in one, uses the GPU if avaliable and if not uses the CPU (all cores if have mxnet-mkl).

That worked!!!

Thanks Thom! you are awesome!