Use GPUs

That paragraph is very interesting "You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the commandline (or logging it in a NumPy array) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs"

Could it come with more practical recommendations or even code snippets illustrating optimal monitoring? For example:

  • Does a metric.update(label, y_pred) (where metric is an mx.metric) also incurs costly data transfer?
  • Does a"something using" metric.get()) suffer from this data transfer + gil problem?
  • should print statements be avoided at all cost in the training loop?
  • print would definitely be fatal. This wouldn’t only invoke the wrath of the GIL (global interpreter lock) but also that of the console output. Even if you were to write C++ code, output to console is a surefire way of killing performance :slight_smile: .
  • As for the metric, that is safer, but you shouldn’t log the metric to an array on the CPU if you can avoid it. Instead a much better strategy is to log it onto an array on the GPU (if that’s where you are) and then occasionally transfer to CPU.

In general what is fatal are O(n) updates and not O(1) updates. We should probably add more details about this. One of the real use cases where we saw this was in a differential privacy application where a (highly talented) scientist decided to log scalar updates to the CPU after each observation, thus killing performance. Note that this is inherent to interpreted Python code and thus hard to avoid via the framework.


Indeed interesting topic. Based on my understanding from the chapter, different parts of neural network will be operated on different cores/GPUs depending on the setting. So, do we regularly save the parameters on respective GPUs and collating and storing them at a lesser frequency to CPU.

1 Like

I have been developing a (small) framework that is helping with training and this is indeed one of the more tricky things to implement.

If you train on multiple GPU’s and don’t want to bring loss and metrics values to the CPU after each batch (to avoid the performance penalty), you have to keep them on GPU where also that particular batch has been processed. So for example as a result every GPU has its own “cumulative_loss” stored.

And then once you want to bring it back to the CPU, you have to iterate over all the GPU’s and collect the metrics and average those.

Still looking for an elegant way to tackle this (and unfortunately the MXNet metrics packages don’t cater for this).

1 Like

@Shishir_Suman, the Trainer, using the kvstore collates and aggregates the gradient on every iteration (i.e for every batch) before the optimizer steps that update the weights. Typically if you use a GPU, this is done on GPU. Then all your GPUs have the same copy of the weight.

If you want to save the parameters to file, when you call .save_parameters() this is going to copy the weights to CPU and they will be stored on disk.

1 Like

This is a very good point. See this issue :slight_smile:

I have a repo here that details a few techniques to optimize training speed.

One of them is indeed to minimize the time spent on CPU. One solution is exactly what you mentioned, simply accumulate the metrics during the epoch on each GPU and collate them only at the end of the epoch to compute the metric.

More realistically you usually want to get some feedback during the training of an epoch, one way is to have a if i % 100 == 0: print_metric()

Another way that works quite well in practice is to take advantage of the asynchronous execution of MXNet to call the load of the next batch of data on GPU (data.as_in_context(ctx)) before processing the metric of the previous one. That way when the metric is processed, the data is already there and ready to go through the GPU.


Great presentation and tips, thanks. Especially the second tip about loading first the next batch before processing the metric sounds like an easy win without too much refactoring required.

Does this have any memory implications (since the code would still have a handle to the previous loss tensor while already loading the new batch)?

@innerlogic Thanks, the only implication is indeed that there is an extra handle on the previous loss tensor, which means that it will use this space on memory in GPU until it is out of scope and garbage collected. In practice it doesn’t have an impact since it’s tiny compared to the actual GPU memory size. You can also preload a few batches of input data on the GPU in advance, which is usually ~10s of MB compared to the 16GB for example of a V100.

The following function should be modified, otherwise calling try_gpu(0) will always assume gpu is available even if we just have cpu.

def try_gpu(i=0):
“”“Return gpu(i) if exists, otherwise return cpu().”""
return context.gpu(i) if context.num_gpus() >= i else context.cpu()

return context.gpu(i) if context.num_gpus() >= i else context.cpu()


return context.gpu(i) if context.num_gpus() >= i + 1 else context.cpu()

1 Like

Thanks for pointing it out.