Using NVIDIA Profiling tools: Visual Profiler and Nsight Compute

Just putting this on the forum since it could be of use for some people.

I haven’t spent long using these tools but I think they offer a little more insight than provided by the MXNet Profiler, if optimising CUDA kernels and GPU performance if your thing!

Anyone used these tools before? If so, I’d be really interested to hear from you. What have you used them for and what metrics do you use the most?

Using NVIDIA’s CUDA Profiling Tools

MXNet’s Profiler is definitely the recommended starting point for profiling MXNet code, but NVIDIA also provides a couple of tools for low level profiling of CUDA code: Visual Profiler and Nsight Compute. You can use these tools to profile all kinds of executables, so they can be used for profiling Python scripts running MXNet.

Visual Profiler is avaliable in CUDA 9 and CUDA 10 toolkits. You can get a timeline view of CUDA kernel executions, and also analyse the profiling results to get automated recommendations. Seems to be the most useful for profiling end-to-end training but found the interface can get slow and unresponsive.

Nsight Compute is avaliable in CUDA 10 toolkit, but can be used to profile code running CUDA 9. You don’t get a timeline view, but you get many low level statistics about each individual kernel executed and can compare multiple runs (i.e. create a baseline). It doesn’t seem to be that useful for profiling end-to-end model training though.

Start by profiling a small section of code (e.g. a few batches) otherwise the visualizations and analysis will take much longer.


  1. On local machine, download and install CUDA toolkit.

    Go to
    You only need the ‘toolkit’ and not the CUDA drivers, etc.
    With CUDA versions, it seems to be possible to use profilers from CUDA 10 toolkit on remote code running CUDA 9.
    CUDA 10 toolkit is required for

  2. Start AWS EC2 instance (with GPU) using the DLAMI (CUDA drivers and toolkit pre-installed).

  3. Allow password-based login via SSH.

    Seems to be the only method to connect to remote machine with NVIDIA profilers.
    Use very strong password, and use a security group with minimal source addresses (i.e. not open to world).

Using Visual Profiler

  • Open nvvp on local machine

    You start the program from the terminal rather than the ‘Applications’ folder.
    On MacOS /Developer/NVIDIA/CUDA-10.0/bin/nvvp

  • Select a workspace

    Choose default: e.g /Users/username/nvvp_workspace

Create New Session

  • File -> New Session

Executable Properties

  • Connection: Manage connections…
    • Add
    • Host name: IP address of the AWS EC2 instance. e.g.
    • Username: Username on AWS EC2 instance. e.g. ubuntu or ec2_user
    • Label: Any string will do e.g. ubuntu@
    • System Type: SSH, Port number: 22
    • Finish
  • Toolkit/Script: Manage
    • Toolkit path: Browse…
    • Should be path to CUDA toolkit on remote instance
    • e.g. /usr/local/cuda-9.2/bin
    • Update library path with defaults? Yes.
    • Finish
  • File:
    • Should be path to Python (in correct conda environment)
    • e.g. /home/ubuntu/anaconda3/envs/mxnet_p36/bin/python
  • Working directory:
    • Optional
  • Arguments:
    • Should be the script you want to profile, and its arguments.
    • e.g. /home/ubuntu/mxnet/example/gluon/mnist/ --cuda --batch-size 100 --epochs 1
    • Select ‘Profile child processes’
  • Next >

Profiling Options

  • Optionally select what you need to be profiled.
    • e.g. Enable CPU thread tracing
  • Finish
  • nvprof version is different. want to proceed? Yes.


  • Should start job straight away
  • “Generating Timeline: Running application to generate timeline.”
  • Seems to keep running >10s after script has completed.


Use analysis tab to run through diagnostics and get automated recommendations.

Using NVIDIA Nsight Compute

Couldn’t get remote execution to work with Nsight Compute (as with Visual Profiler). Can’t find remote python executable.

See in known issues: “Launching applications on remote targets/platforms is not yet supported.”

Collecting data

  1. Download CUDA 10 toolkit on remote machine

  2. Install toolkit

You can skip the installation of the drivers, and just install the toolkit.
By default the toolkit will be installed to /usr/local/cuda-10.0/

  1. Use nv-nsight-cu-cli to collect data

Can be found at /usr/local/cuda-10.0/NsightCompute-1.0/nv-nsight-cu-cli

And some useful arguments are:

  • -f to force overwrite of profiling file
  • -c to specify number of kernels to collect
  • -s to specify number of kernels to skip before collecting
/usr/local/cuda-10.0/NsightCompute-1.0/nv-nsight-cu-cli -f -c 10 /home/ubuntu/anaconda3/envs/mxnet_p36/bin/python /home/ubuntu/mxnet/example/gluon/mnist/ --cuda --batch-size 500 --epochs 1

You should now have a profile.nsight-cuprof-report file in the current working directory.

Copy data to local machine

Use scp, Jupyter download feature, or otherwise to get the file to the local machine.

Visualizing/Analysing data

  • Open ‘NVIDIA Nsight Compute’ from /Developer/NVIDIA/CUDA-10.0/NsightCompute-1.0.
  • File -> Open file…
  • Select the file you just transfered to the local machine.
  • Compare to GPU ‘Speed of Light’ (SOL).