Best practices for prediction on a machine with multiple GPUs

baalliso · October 31, 2017, 2:27pm

MXNet offers great out-of-the-box features accelerating training on machines with multiple GPUs. Because of this, I’m finding that on my problem I’m spending a large percentage of the total job time computing evaluation metrics. In case it’s pertinent, I’m using the symbolic API with a custom training loop. I am computing metrics inside the training loop on a small amount of data for early stopping, and then at the end of training loop on the test data. In total this is >50% of the total runtime of the job.

So I’d like to know—are there best practices for speeding up prediction when multiple devices are present? I believe prediction occurs on just a single GPU even if many are present, which feels like poor resource utilization if most of the GPUs are idle for long periods. Any examples would be super useful.

Some specifics if it’s pertinent. I’m solving a prediction problem with ~100K outputs, with a final softmax layer. My two target metrics are perplexity and Recall-at-k (I think MXNet calls this TopKAccuracy). Since the recall metric is very expensive to compute (requires the rank of the true label) I’m early stopping on perplexity only. I only compute the recall metric on the test data.

reminisce · November 1, 2017, 3:37am

Could you post the code snippet? If you have profiling results to share, that would be very helpful too. In the meantime, setting correct env variables would accelerate computation. https://mxnet.incubator.apache.org/how_to/env_var.html

eric-haibin-lin · November 8, 2017, 12:36am

The metrics in MXNet converts NDArrays to numpy array, which is slow for a large prediction result due to its high communication cost. @szha is working on removing those conversions.

baalliso · November 8, 2017, 7:56am

So I found two problems by reading through my code, the source and thinking a bit.

As @eric-haibin-lin notes, some of the metrics call asnumpy() on the ndarrays very early. If you have many outputs this is extremely slow.
I had written my own metric and was updating it by with a sequence of mod.forward(data), metric.update(mod.get_outputs()). Through reading the source I realised that this was quite different to calling mod.update_metric(metric, labels)

Topic		Replies	Views
Improve the speed of evaluating the metric Discussion	1	610	April 11, 2018
Single-machine multi-GPU training, time is not speeding up Gluon	5	2169	November 16, 2018
Evaluate accuracy on multi GPU machine Gluon	5	1408	October 10, 2018
Understanding MXNet multi-gpu performance Performance	7	1843	November 5, 2018
About the Performance category Performance	0	696	October 9, 2017

Best practices for prediction on a machine with multiple GPUs

Related Topics