Compiling for GPUs

There are a few different ways for code to access GPUs. This page will focus only on compiling code that uses those techniques rather than explaining the techniques themselves.

For more general information on compilers on the HPCC, see our reference page.

GPU compatibility

Because GPU hardware can change dramatically from device to device, it's important to keep in mind the GPUs your code will run on when compiling. In practical terms on the HPCC, this means selecting version of CUDA that is best suited for your target GPU.

See the table below for the CUDA compute capability and associated CUDA versions of the various GPUs available on the HPCC (additionally, see this CUDA version table and the compute capabilities for GPUs for more information). Using the maximum version or lower will ensure that code compiled with that version of CUDA will be able to run on the associated GPU by default.

GPU	Compute capability	Minimum CUDA version	Maximum CUDA version*
`k80`	3.7	4.2	10.2
`v100`	7.0	9.0
`v100s`	7.0	9.0
`a100`	8.0	11.0
`l40s`	8.9	11.8
`h200`	9.0	11.8

*Support beyond the maximum version

CUDA versions greater than the maximum CUDA version listed above may still work for certain GPUs (for example, CUDA versions up to 11.4 can work with compute capability 3.7), but will not compile for them by default. In these situations, you should explicitly state the minimum compute capability when compiling as discussed below.

Compiling for maximum compatibility

The minimum version listed in the table above is necessary to support all capabilities of a GPU. However, the drivers for NVIDIA GPUs are generally backwards compatible with earlier versions of PTX code (see below). This means you can use an earlier CUDA version to compile code, and it will be able to run on any newer GPUs. For more information, see Compiling for specific GPUs below.

CUDA code

CUDA Fortran (that is, Fortran code with CUDA specific extensions) is compiled using the same nvfortran compiler described in the NVHPC section of our compiler reference.

CUDA C/C++ code (that is, C/C++ code with CUDA specific extensions to run kernels on a GPU device), is compiled using the nvcc compiler. The recommended way to access this compiler is to load the NVHPC module as described in our compiler reference or a foss or GCC toolchain with a CUDA module.

Behind the scenes, nvcc will use gcc/g++ to compile the C/C++ code itself. The version used is whatever is on your path. This means that if you load CUDA by itself, you will be using the system version of gcc/g++ which is much older than the versions available in the module system. For this reason, we recommend loading either the NVHPC modules (which will use the version of gcc/g++ suffixing the module version), or loading a foss or GCC module in addition to CUDA.

CFortran

CUDA with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.cu -o foo
./foo

CUDA with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.cuf -o foo
./foo

Compiling for specific GPUs

GPU code is compiled in two stages:

Compiling into a virtual instruction set like assembly code (called PTX)
Compiling the virtual instructions into binary code (called a cubin) that actually runs on the GPU

These stages are controlled by the compute capability specified to nvcc (in the previous examples, this is set implicitly to 5.2) and nvcc can embed the results of stage 1, stage 2, or both for various compute capabilities in the final executable. If the stage 2 output is missing for the compute capability of the GPU that the code is executed on, the NVIDIA driver will just-in-time (JIT) compile any stage 1 code it finds at runtime into stage 2 code appropriate for that GPU.

In general, you should use the lowest compute capability your code supports in step 1 (for the widest compatibility with future JIT compilation) and the compute capability of the target GPU in step 2 (for the best optimization).

To specify compute capability x.y for stage 1, use the -arch=compute_xy flag, and for stage 2, use the -code=sm_xy flag. You can also specify -code=compute_xy to embed the output of stage 1 into the final binary for JIT compilation. Multiple compute_xy and sm_xy values can be supplied to -code in a comma separated list.

See NVIDIA's documentation on GPU compilation for more information and examples.

Compiling for k80 GPUs with CUDA 11.7

As discussed above, CUDA 11.7 will not compile for k80 GPUs by default. However, we can specify the corresponding compute capabilities explicitly:

module load NVHPC/22.11-CUDA-11.7.0  # Any CUDA >=12 will not work
nvcc foo.cu -o foo \
  -arch=compute_35 -code=compute_37,sm_37,sm_70
./foo

The resulting executable will be able to run on all (double precision) GPUs at ICER:

The k80 by compiling compute_37 (and higher)-compatible PTX into an sm_37-compatible cubin.
The v100 by compiling compute_37 (and higher)-compatible PTX into an sm_70-compatible cubin.
The a100 by JIT compiling the embedded compute_35 (and higher)-compatible PTX at runtime.
The h200 by JIT compiling the embedded compute_35 (and higher)-compatible PTX at runtime.

These flags can be modified to reduce the binary size (by removing sm_* targets), improve run time by removing the need for JIT compilation (by adding sm_* targets), or use optimized instructions (by adding compute_* targets).

CUDA libraries

The NVHPC and CUDA modules offer many CUDA accelerated math libraries, like cuBLAS, cuSOLVER and cuFFT.

For C/C++ code, since using many these libraries do not require writing CUDA code, using the nvcc compiler is optional. We refer to the documentation for the specific libraries for how to link them, but give examples of linking cuBLAS and cuFFT with nvcc and the GNU compilers directly.

Take care to link to libraries that are all distributed in the same version of CUDA, to use a version of CUDA compatible with the desired GPUs, and (if using shared libraries) to load the same version of CUDA when running the executable.

Please note that not all versions of NVHPC and CUDA are available on all types of nodes in the HPCC. Please see our documentation on Using different cluster architectures for details, and reference the NVHPC and CUDA module listings for availability.

The examples below will work on all nodes other than those with k80 GPUs. Please see the example in Compiling for specific GPUs for steps to use k80 GPUs.

CFortran

Using CUDA shared libraries with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.c -o foo \
  -lcublas \  # For BLAS
  -lcufft  # For cuFFT
./foo

Using CUDA shared libraries with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.f90 -o foo \
  -cudalib=cublas \  # For cuBLAS
  -cudalib=lcufft  # For cuFFT
./foo

Fortran support for linking to the CUDA libraries is limited to NVIDIA's compilers. See NVIDIA's Fortran CUDA interfaces for more information.

C

Using CUDA shared libraries with fosscuda

module load foss/2023a CUDA/12.1.1
gcc foo.c -o foo \
  -lcudart  \ # For CUDA runtime routines like memory management
  -lcublas \  # For cuBLAS
  -lcufft  # For cuFFT
./foo

cuDNN

A popular set of CUDA libraries not included in the CUDA toolkit is cuDNN. On the HPCC, cuDNN is available as a module. See the available versions in the cuDNN module listing or search for available versions with

module spider cuDNN

and choose one which uses a version of CUDA compatible with any other GPU-based work you may be doing. See NVIDIA's API reference for which libraries to link against.

GPU offloading

Parts of code can be offloaded onto the GPUs using directive-based APIs like OpenMP and OpenACC. Currently, the recommended approach is to use OpenACC with the NVIDIA HPC SDK compilers.

OpenACC

Offloading with OpenACC is primarily supported by the NVIDIA's compilers in the NVHPC modules. Using the -acc option will activate OpenACC and run kernels by default on the GPU.

The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).

The examples below will work on all nodes other than those with k80 GPUs. Please see the example in Compiling for specific GPUs for steps to use k80 GPUs.

CFortran

GPU offloading with OpenACC with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvc foo.c -o foo \
  -acc \  # To use OpenACC (default is on GPU)
  -gpu=cc70,cc80,cc90  # To embed GPU code for various compute capabilities
./foo

GPU offloading with OpenACC with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.c -o foo \
  -acc \  # To use OpenACC (default is on GPU)
  -gpu=cc70,cc80,cc90  # To embed GPU code for various compute capabilities
./foo

The GCC modules also provide support for offloading OpenACC code to GPUs. Support for these compilers is very limited, and simple tests indicate that they can suffer from reduced performance in comparison to NVIDIA's compilers or running multi-threaded (or, in extreme cases, single-threaded) on the CPU. Experiment with these versions at your own risk.

CFortran

GPU offloading with OpenMP with GCC

module load GCC/12.3.0
gcc foo.c -o foo \
  -fopenacc \  # To activate OpenACC instructions
  -foffload=nvptx-none="sm_70 -lm"
# Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has 
# access to math libraries
./foo

GPU offloading with OpenMP with GCC

module load GCC/12.3.0
gfortran foo.c -o foo \
  -fopenacc \  # To activate OpenMP instructions
  -foffload=nvptx-none="-lm -lgfortran"



Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has 
# access to math and Fortran libraries
./foo

OpenMP

Offloading with OpenMP is primarily supported by the NVIDIA's compilers in the NVHPC modules. Using the -mp=gpu option will set OpenMP code to use a GPU as a target device.

The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).

The examples below will work on all nodes other than those with k80 GPUs. Please see the example in Compiling for specific GPUs for steps to use k80 GPUs.

CFortran

GPU offloading with OpenMP with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.c -o foo \
  -mp=gpu \  # To use OpenMP on GPU
  -gpu=cc70,cc80,cc90  # Embed GPU code for compute capabilities
./foo

GPU offloading with OpenMP with NVHPC

module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.c -o foo \
  -mp=gpu \  # To use OpenMP on GPU
  -gpu=cc70,cc80,cc90  # Embed GPU code for compute capabilities
./foo

The GCC modules also provide support for offloading OpenMP code to GPUs. Support for these compilers is very limited, and simple tests indicate that they can suffer from reduced performance in comparison to NVIDIA's compilers or running multi-threaded (or, in extreme cases, single-threaded) on the CPU. Experiment with these versions at your own risk.

CFortran

GPU offloading with OpenMP with GCC

module load GCC/12.3.0
gcc foo.c -o foo \
  -fopenmp \  # To activate OpenMP instructions
  -foffload=nvptx-none="-march=sm_70 -lm"
# Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has 
# access to math libraries
./foo

GPU offloading with OpenMP with GCC

$ module load GCC/12.3.0
$ gfortran foo.c -o foo \
    -fopenmp \  # To activate OpenMP instructions
    -foffload=nvptx-none="-march=sm_70 -lm -lgfortran"
  # Previous option offloads code to NVIDIA GPUs, specifies compute
  # capability to embed gpu code for, and ensure that offloaded code has 
  # access to math and Fortran libraries
$ ./foo

Other GPU specific compilation options can be passed in the quotes following -foffload=nvptx-none=, e.g., -foffload=nvptx-none="sm_70 -lm -latomic -O3".