Using NVIDIA Profiling tools: Visual Profiler and Nsight Compute

thomelane · January 18, 2019, 11:42pm

Just putting this on the forum since it could be of use for some people.

I haven’t spent long using these tools but I think they offer a little more insight than provided by the MXNet Profiler, if optimising CUDA kernels and GPU performance if your thing!

Anyone used these tools before? If so, I’d be really interested to hear from you. What have you used them for and what metrics do you use the most?

Using NVIDIA’s CUDA Profiling Tools

MXNet’s Profiler is definitely the recommended starting point for profiling MXNet code, but NVIDIA also provides a couple of tools for low level profiling of CUDA code: Visual Profiler and Nsight Compute. You can use these tools to profile all kinds of executables, so they can be used for profiling Python scripts running MXNet.

Visual Profiler is avaliable in CUDA 9 and CUDA 10 toolkits. You can get a timeline view of CUDA kernel executions, and also analyse the profiling results to get automated recommendations. Seems to be the most useful for profiling end-to-end training but found the interface can get slow and unresponsive.

Nsight Compute is avaliable in CUDA 10 toolkit, but can be used to profile code running CUDA 9. You don’t get a timeline view, but you get many low level statistics about each individual kernel executed and can compare multiple runs (i.e. create a baseline). It doesn’t seem to be that useful for profiling end-to-end model training though.

Start by profiling a small section of code (e.g. a few batches) otherwise the visualizations and analysis will take much longer.

Setup

On local machine, download and install CUDA toolkit.

Go to https://developer.nvidia.com/cuda-toolkit
You only need the ‘toolkit’ and not the CUDA drivers, etc.
With CUDA versions, it seems to be possible to use profilers from CUDA 10 toolkit on remote code running CUDA 9.
CUDA 10 toolkit is required for
Start AWS EC2 instance (with GPU) using the DLAMI (CUDA drivers and toolkit pre-installed).
Allow password-based login via SSH.

Seems to be the only method to connect to remote machine with NVIDIA profilers.
Use very strong password, and use a security group with minimal source addresses (i.e. not open to world).
See https://aws.amazon.com/premiumsupport/knowledge-center/ec2-password-login/

Using Visual Profiler

Open nvvp on local machine

You start the program from the terminal rather than the ‘Applications’ folder.
On MacOS /Developer/NVIDIA/CUDA-10.0/bin/nvvp
Select a workspace

Choose default: e.g /Users/username/nvvp_workspace

Create New Session

File -> New Session

Executable Properties

Connection: Manage connections…
- Add
- Host name: IP address of the AWS EC2 instance. e.g. 52.12.34.567
- Username: Username on AWS EC2 instance. e.g. ubuntu or ec2_user
- Label: Any string will do e.g. ubuntu@52.12.34.567
- System Type: SSH, Port number: 22
- Finish
Toolkit/Script: Manage
- Toolkit path: Browse…
- Should be path to CUDA toolkit on remote instance
- e.g. /usr/local/cuda-9.2/bin
- Update library path with defaults? Yes.
- Finish
File:
- Should be path to Python (in correct conda environment)
- e.g. /home/ubuntu/anaconda3/envs/mxnet_p36/bin/python
Working directory:
- Optional
Arguments:
- Should be the script you want to profile, and its arguments.
- e.g. /home/ubuntu/mxnet/example/gluon/mnist/mnist.py --cuda --batch-size 100 --epochs 1
- Select ‘Profile child processes’
Next >

Profiling Options

Optionally select what you need to be profiled.
- e.g. Enable CPU thread tracing
Finish
nvprof version is different. want to proceed? Yes.

Capture

Should start job straight away
“Generating Timeline: Running application to generate timeline.”
Seems to keep running >10s after script has completed.

Analysis

Use analysis tab to run through diagnostics and get automated recommendations.

Using NVIDIA Nsight Compute

Couldn’t get remote execution to work with Nsight Compute (as with Visual Profiler). Can’t find remote python executable.

See in known issues: “Launching applications on remote targets/platforms is not yet supported.”

Collecting data

Download CUDA 10 toolkit on remote machine
Install toolkit

You can skip the installation of the drivers, and just install the toolkit.
By default the toolkit will be installed to /usr/local/cuda-10.0/

Use nv-nsight-cu-cli to collect data

Can be found at /usr/local/cuda-10.0/NsightCompute-1.0/nv-nsight-cu-cli

And some useful arguments are:

-f to force overwrite of profiling file
-c to specify number of kernels to collect
-s to specify number of kernels to skip before collecting

/usr/local/cuda-10.0/NsightCompute-1.0/nv-nsight-cu-cli -f -c 10 /home/ubuntu/anaconda3/envs/mxnet_p36/bin/python /home/ubuntu/mxnet/example/gluon/mnist/mnist.py --cuda --batch-size 500 --epochs 1

You should now have a profile.nsight-cuprof-report file in the current working directory.

Copy data to local machine

Use scp, Jupyter download feature, or otherwise to get the file to the local machine.

Visualizing/Analysing data

Open ‘NVIDIA Nsight Compute’ from /Developer/NVIDIA/CUDA-10.0/NsightCompute-1.0.
File -> Open file…
Select the file you just transfered to the local machine.
Compare to GPU ‘Speed of Light’ (SOL).

Topic		Replies	Views
Nsight compute Discussion linux , general-question	6	4438	April 8, 2019
`MXImperativeInvokeEx` is taking a long time Performance	8	770	January 6, 2019
Understanding MXNet GPU Memory Allocation	2	887	June 26, 2018
Best practices for prediction on a machine with multiple GPUs	3	1190	November 8, 2017
How to inspect the detail memory usage of mxnet	1	425	December 12, 2018