Compiling for GPUs
There are a few different ways for code to access GPUs. This page will focus only on compiling code that uses those techniques rather than explaining the techniques themselves.
For more general information on compilers on the HPCC, see our reference page.
GPU compatibility
Because GPU hardware can change dramatically from device to device, it's important to keep in mind the GPUs your code will run on when compiling. In practical terms on the HPCC, this means selecting version of CUDA that is best suited for your target GPU.
See the table below for the CUDA compute capability and associated CUDA versions of the various GPUs available on the HPCC (additionally, see this CUDA version table and the compute capabilities for GPUs for more information). Using the maximum version or lower will ensure that code compiled with that version of CUDA will be able to run on the associated GPU by default.
GPU | Compute capability | Minimum CUDA version | Maximum CUDA version* |
---|---|---|---|
k80 |
3.7 | 4.2 | 10.2 |
v100 |
7.0 | 9.0 | |
v100s |
7.0 | 9.0 | |
a100 |
8.0 | 11.0 | |
l40s |
8.9 | 11.8 | |
h200 |
9.0 | 11.8 |
*Support beyond the maximum version
CUDA versions greater than the maximum CUDA version listed above may still work for certain GPUs (for example, CUDA versions up to 11.4 can work with compute capability 3.7), but will not compile for them by default. In these situations, you should explicitly state the minimum compute capability when compiling as discussed below.
Compiling for maximum compatibility
The minimum version listed in the table above is necessary to support all capabilities of a GPU. However, the drivers for NVIDIA GPUs are generally backwards compatible with earlier versions of PTX code (see below). This means you can use an earlier CUDA version to compile code, and it will be able to run on any newer GPUs. For more information, see Compiling for specific GPUs below.
CUDA code
CUDA Fortran (that is, Fortran code with CUDA specific extensions) is compiled
using the same nvfortran
compiler described in the NVHPC
section of our
compiler reference.
CUDA C/C++ code (that is, C/C++ code with CUDA specific extensions to run
kernels on a GPU device), is compiled using the nvcc
compiler. The
recommended way to access this compiler is to load the
NVHPC
module as described in our
compiler reference or a
foss
or
GCC
toolchain with a
CUDA
module.
Behind the scenes, nvcc
will use gcc/g++
to compile the C/C++ code itself.
The version used is whatever is on your path. This means that if you load
CUDA
by itself, you will be using the system version of gcc/g++
which is
much older than the versions available in the module system. For this reason,
we recommend loading either the NVHPC
modules (which will use the version of
gcc/g++
suffixing the module version), or loading a foss
or GCC
module in addition to CUDA
.
module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.cu -o foo
./foo
module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.cuf -o foo
./foo
Compiling for specific GPUs
GPU code is compiled in two stages:
- Compiling into a virtual instruction set like assembly code (called PTX)
- Compiling the virtual instructions into binary code (called a cubin) that actually runs on the GPU
These stages are controlled by the compute capability specified to nvcc
(in
the previous examples, this is set implicitly to 5.2) and nvcc
can embed the
results of stage 1, stage 2, or both for various compute capabilities in the
final executable. If the stage 2 output is missing for the compute capability
of the GPU that the code is executed on, the NVIDIA driver will just-in-time
(JIT) compile any stage 1 code it finds at runtime into stage 2 code
appropriate for that GPU.
In general, you should use the lowest compute capability your code supports in step 1 (for the widest compatibility with future JIT compilation) and the compute capability of the target GPU in step 2 (for the best optimization).
To specify compute capability x.y for stage 1, use the -arch=compute_xy
flag,
and for stage 2, use the -code=sm_xy
flag. You can also specify
-code=compute_xy
to embed the output of stage 1 into the final binary for JIT
compilation. Multiple compute_xy
and sm_xy
values can be supplied to
-code
in a comma separated list.
See NVIDIA's documentation on GPU compilation for more information and examples.
Compiling for k80
GPUs with CUDA 11.7
As discussed above, CUDA 11.7 will not compile for
k80
GPUs by default. However, we can specify the corresponding compute
capabilities explicitly:
module load NVHPC/22.11-CUDA-11.7.0 # Any CUDA >=12 will not work
nvcc foo.cu -o foo \
-arch=compute_35 -code=compute_37,sm_37,sm_70
./foo
The resulting executable will be able to run on all (double precision) GPUs at ICER:
- The
k80
by compilingcompute_37
(and higher)-compatible PTX into ansm_37
-compatiblecubin
. - The
v100
by compilingcompute_37
(and higher)-compatible PTX into ansm_70
-compatiblecubin
. - The
a100
by JIT compiling the embeddedcompute_35
(and higher)-compatible PTX at runtime. - The
h200
by JIT compiling the embeddedcompute_35
(and higher)-compatible PTX at runtime.
These flags can be modified to reduce the binary size (by removing sm_*
targets), improve run time by removing the need for JIT compilation (by adding sm_*
targets), or use optimized instructions (by adding compute_*
targets).
CUDA libraries
The NVHPC
and
CUDA
modules offer many CUDA
accelerated math
libraries, like
cuBLAS, cuSOLVER and cuFFT.
For C/C++ code, since using many these libraries do not require writing CUDA
code, using the nvcc
compiler is optional. We refer to the documentation for
the specific libraries for how to link them, but give examples of linking
cuBLAS
and cuFFT
with nvcc
and the GNU compilers directly.
Take care to link to libraries that are all distributed in the same version of CUDA, to use a version of CUDA compatible with the desired GPUs, and (if using shared libraries) to load the same version of CUDA when running the executable.
Please note that not all versions of NVHPC and CUDA are available on all types of nodes in the HPCC. Please see our documentation on Using different cluster architectures for details, and reference the NVHPC and CUDA module listings for availability.
The examples below will work on all nodes other than those with k80
GPUs. Please see the example in Compiling for specific GPUs for steps to use k80
GPUs.
module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.c -o foo \
-lcublas \ # For BLAS
-lcufft # For cuFFT
./foo
module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.f90 -o foo \
-cudalib=cublas \ # For cuBLAS
-cudalib=lcufft # For cuFFT
./foo
Fortran support for linking to the CUDA libraries is limited to NVIDIA's compilers. See NVIDIA's Fortran CUDA interfaces for more information.
module load foss/2023a CUDA/12.1.1
gcc foo.c -o foo \
-lcudart \ # For CUDA runtime routines like memory management
-lcublas \ # For cuBLAS
-lcufft # For cuFFT
./foo
cuDNN
A popular set of CUDA libraries not included in the CUDA toolkit is cuDNN. On the HPCC, cuDNN is available as a module. See the available versions in the cuDNN module listing or search for available versions with
module spider cuDNN
and choose one which uses a version of CUDA compatible with any other GPU-based work you may be doing. See NVIDIA's API reference for which libraries to link against.
GPU offloading
Parts of code can be offloaded onto the GPUs using directive-based APIs like OpenMP and OpenACC. Currently, the recommended approach is to use OpenACC with the NVIDIA HPC SDK compilers.
OpenACC
Offloading with OpenACC is primarily supported by the NVIDIA's compilers in the
NVHPC
modules. Using the -acc
option will activate OpenACC and run kernels
by default on the GPU.
The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).
The examples below will work on all nodes other than those with k80
GPUs.
Please see the example in Compiling for specific
GPUs for steps to use k80
GPUs.
module load NVHPC/23.7-CUDA-12.1.1
nvc foo.c -o foo \
-acc \ # To use OpenACC (default is on GPU)
-gpu=cc70,cc80,cc90 # To embed GPU code for various compute capabilities
./foo
module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.c -o foo \
-acc \ # To use OpenACC (default is on GPU)
-gpu=cc70,cc80,cc90 # To embed GPU code for various compute capabilities
./foo
The GCC
modules also provide support for offloading OpenACC code to GPUs.
Support for these compilers is very limited, and simple tests indicate that
they can suffer from reduced performance in comparison to NVIDIA's compilers or
running multi-threaded (or, in extreme cases, single-threaded) on the CPU.
Experiment with these versions at your own risk.
module load GCC/12.3.0
gcc foo.c -o foo \
-fopenacc \ # To activate OpenACC instructions
-foffload=nvptx-none="sm_70 -lm"
# Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has
# access to math libraries
./foo
module load GCC/12.3.0
gfortran foo.c -o foo \
-fopenacc \ # To activate OpenMP instructions
-foffload=nvptx-none="-lm -lgfortran"
Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has
# access to math and Fortran libraries
./foo
OpenMP
Offloading with OpenMP is primarily supported by the NVIDIA's compilers in the
NVHPC
modules. Using the -mp=gpu
option will set OpenMP code to use a GPU
as a target device.
The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).
The examples below will work on all nodes other than those with k80
GPUs.
Please see the example in Compiling for specific
GPUs for steps to use k80
GPUs.
module load NVHPC/23.7-CUDA-12.1.1
nvcc foo.c -o foo \
-mp=gpu \ # To use OpenMP on GPU
-gpu=cc70,cc80,cc90 # Embed GPU code for compute capabilities
./foo
module load NVHPC/23.7-CUDA-12.1.1
nvfortran foo.c -o foo \
-mp=gpu \ # To use OpenMP on GPU
-gpu=cc70,cc80,cc90 # Embed GPU code for compute capabilities
./foo
The GCC
modules also provide support for offloading OpenMP code to GPUs.
Support for these compilers is very limited, and simple tests indicate that
they can suffer from reduced performance in comparison to NVIDIA's compilers or
running multi-threaded (or, in extreme cases, single-threaded) on the CPU.
Experiment with these versions at your own risk.
module load GCC/12.3.0
gcc foo.c -o foo \
-fopenmp \ # To activate OpenMP instructions
-foffload=nvptx-none="-march=sm_70 -lm"
# Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has
# access to math libraries
./foo
$ module load GCC/12.3.0
$ gfortran foo.c -o foo \
-fopenmp \ # To activate OpenMP instructions
-foffload=nvptx-none="-march=sm_70 -lm -lgfortran"
# Previous option offloads code to NVIDIA GPUs, specifies compute
# capability to embed gpu code for, and ensure that offloaded code has
# access to math and Fortran libraries
$ ./foo
Other GPU specific compilation options can be passed in the quotes following
-foffload=nvptx-none=
, e.g., -foffload=nvptx-none="sm_70 -lm -latomic -O3"
.