GPU Tools

NVIDIA's CUDA toolkit comes with a variety of software that you can use to monitor the status of GPUs and analyze the performance of GPU code. These software resources can greatly enhance the efficacy of GPU usage.

Find Available GPUs

NVIDIA's System Management Interface (or nvidia-smi) is useful for seeing information about GPU utilization. This is particularly useful when using GPU development nodes as the GPU are shared between all users on the node. The nvidia-smi utility is available without loading a CUDA module.

Running nvidia-smi with no arguments will show a table of information about all the GPUs on that node and a table of all running processes. The GPUs are indexed with an integer. Once you identify one with low utilization, set the CUDA_VISIBLE_DEVICES environment variable to control which GPU(s) your application will use. For example,

export CUDA_VISIBLE_DEVICES=1

will make your application use the second GPU on the node as GPU indices start from zero. You can also tell your application to use multiple GPUs; for example,

export CUDA_VISIBLE_DEVICES=2,3

Note

If you don't set CUDA_VISIBLE_DEVICES, your program will default to using the GPUs in order. If your program only uses one GPU, this will be device 0.

See this documentation for additional nvidia-smi options.

Multi-GPU Communication

If your software uses multiple GPUs, it ideally makes use of peer-to-peer communication. Peer-to-peer GPU communication eliminates the CPU as a data transfer "middleman." When data must first be sent to the CPU before passing on to the destination GPU, this adds to overall time the transfer takes and may also delay the CPU from communicating additional data and instructions to the GPUs.

Nodes that support NVLink support peer-to-peer communication. Other multi-GPU nodes may still support peer-to-peer communication depending on their network topology; this can be checked with nvidia-smi topo -m. This command will produce a matrix showing the various connections between the GPUs.

Software will make use of peer-to-peer communication if it includes directives like cudaDeviceEnablePeerAccess or cudaMemcpyDeviceToDevice. See NVIDIA's simpleP2P or mergeSort CUDA samples for examples.

Debugging Tools

There are three suggested options for debugging CUDA code on the HPCC: * If you're already familiar with GDB, CUDA-GDB is available through the CUDA modules. Documentation is available from NVIDIA. * If you use VS Code for writing software, NVIDIA has the Nsight Visual Studio Code extension. * ICER pays for the TotalView debugger. This debugger is best launched from the command line through an Interactive Desktop OnDemand app. There is documentation for both its modern and classic interfaces. Some advanced tools may only be available in the classic interface.

Profiling Tools

NVIDIA's Nsight Systems and Nsight Compute are profiling and performance analysis tools that can be used to identify performance bottlenecks and other software optimizations. They are the modern replacements for nvprof and the NVIDIA Visual Profiler.

Each version of the CUDA toolkit ships with different versions of Nsight Systems and Compute as laid out in the table below. You should use a CUDA version that is compatible with your desired GPU.

CUDA Toolkit Version	Nsight Systems Version	Nsight Compute Version
12.6.0	2024.4.2	2024.3.0
12.4.0	2023.4.4	2024.1.0
12.3.0	2023.3.3	2023.3.0
12.1.1	2023.1.2	2023.1.1
11.7.0	2022.1.3	2022.2.0

Nsight Compute can be used to profile CUDA kernels (functions that run on GPUs). On the other hand, Nsight Systems analyzes everything involved in running your code: CPU parallelization, CPU-GPU communication, network communications, OS interactions, and more. If you want a holistic picture of how your software is running, Nsight Systems is likely the tool you want to use. If you want to look closely at the details of GPU performance, you might consider Nsight Compute instead.

After loading one of the CUDA modules (e.g., module load CUDA/12.6.0), Nsight Systems can be run on the command line with the nsys command and Compute with the ncu command. See NVIDIA's documentation for Systems and Compute. There are also videos for Systems and tutorials for both Systems and Compute.

Python Support

Nsight Systems and Compute can be used with Python, including support for popular data science and machine learning libraries like Dask and PyTorch. If you use JupyterLab, NVIDIA has created an extension that allows you to use Systems and Compute within JupyterLab.

To use the extension, it's best to have your desired CUDA module loaded alongside your JupyterLab instance. This way the path to the Nsight executables is available. Options for launching a JupyterLab server with a CUDA module loaded include:

loading CUDA inside JupyterLab by installing the jupyterlmod extension as in this Lab Notebook
from a terminal in a Interactive Desktop OnDemand App after loading the CUDA module
connecting to an existing server in VS Code when that server is launched after the CUDA module has been loaded