Skip to content

Compiling for GPUs

There are a few different ways for code to access GPUs. This page will focus only on compiling code that uses those techniques rather than explaining the techniques themselves.

For more general information on compilers on the HPCC, see our reference page.

GPU compatibility

Because GPU hardware can change dramatically from device to device, it's important to keep in mind the GPUs your code will run on when compiling. In practical terms on the HPCC, this means selecting version of CUDA that is best suited for your target GPU.

See the table below for the CUDA compute capability and associated CUDA versions of the various GPUs available on the HPCC (additionally, see this CUDA version table and the compute capabilities for GPUs for more information). Using the maximum version or lower will ensure that code compiled with that version of CUDA will be able to run on the associated GPU by default.

GPU Compute capability Minimum CUDA version Maximum CUDA version*
k20 3.5 4.2 10.2
k80 3.7 4.2 10.2
v100 7.0 9.0
a100 8.0 11.0

*Support beyond the maximum version

CUDA versions greater than the maximum CUDA version listed above may still work for certain GPUs (for example, CUDA versions up to 11.4 can work with compute capabilities 3.5 and 3.7), but will not compile for them by default. In these situations, you should explicitly state the minimum compute capability when compiling as discussed below.

Compiling for maximum compatibility

The minimum version listed in the table above is necessary to support all capabilities of a GPU. However, the drivers for NVIDIA GPUs are generally backwards compatible with earlier versions of PTX code (see below). This means you can use an earlier CUDA version to compile code, and it will be able to run on any newer GPUs. For more information, see Compiling for specific GPUs below.

For a good compromise between features and compatibility with all GPUs at ICER, CUDA 9.x or 10.x will work well.

CUDA code

CUDA Fortran (that is, Fortran code with CUDA specific extensions) is compiled using the same nvfortran compiler described in the NVHPC section of our compiler reference.

CUDA C/C++ code (that is, C/C++ code with CUDA specific extensions to run kernels on a GPU device), is compiled using the nvcc compiler. The recommended way to access this compiler is to load the NVHPC module as described in our compiler reference or a fosscuda or gcccuda toolchain. Alternatively, nvcc is also included in any of the CUDA modules.

Behind the scenes, nvcc will use gcc/g++ to compile the C/C++ code itself. The version used is whatever is on your path. This means that if you load CUDA by itself, you will be using the system version of gcc/g++ which is much older than the versions available in the module system. For this reason, we recommend loading either the NVHPC modules (which will use the version of gcc/g++ suffixing the module version), or loading a fosscuda or gcccuda module that will load an appropriate version of CUDA along with the corresponding foss or GCC toolchain.

CUDA with NVHPC
1
2
3
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvcc foo.cu -o foo
$ ./foo
CUDA with NVHPC
1
2
3
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvfortran foo.cuf -o foo
$ ./foo

Compiling for specific GPUs

GPU code is compiled in two stages:

  1. Compiling into a virtual instruction set like assembly code (called PTX)
  2. Compiling the virtual instructions into binary code (called a cubin) that actually runs on the GPU

These stages are controlled by the compute capability specified to nvcc (in the previous examples, this is set implicitly to 5.2) and nvcc can embed the results of stage 1, stage 2, or both for various compute capabilities in the final executable. If the stage 2 output is missing for the compute capability of the GPU that the code is executed on, the NVIDIA driver will just-in-time (JIT) compile any stage 1 code it finds at runtime into stage 2 code appropriate for that GPU.

In general, you should use the lowest compute capability your code supports in step 1 (for the widest compatibility with future JIT compilation) and the compute capability of the target GPU in step 2 (for the best optimization).

To specify compute capability x.y for stage 1, use the -arch=compute_xy flag, and for stage 2, use the -code=sm_xy flag. You can also specify -code=compute_xy to embed the output of stage 1 into the final binary for JIT compilation. Multiple compute_xy and sm_xy values can be supplied to -code in a comma separated list.

See NVIDIA's documentation on GPU compilation for more information and examples.

Compiling for k20 and k80 GPUs with CUDA 11.4

As discussed above, CUDA 11.4 will not compile for k20 and k80 GPUs by default. However, we can specify the corresponding compute capabilities explicitly:

1
2
3
4
$ module load NVHPC/21.9-GCC-10.3.0-CUDA-11.4
$ nvcc foo.cu -o foo \
    -arch=compute_35 -code=compute_35,sm_35,sm_37,sm_70
$ ./foo

The resulting executable will be able to run on all GPUs at ICER:

  • The k20 by compiling compute_35 (and higher)-compatible PTX into an sm_35-compatible cubin.
  • The k80 by compiling compute_35 (and higher)-compatible PTX into an sm_37-compatible cubin.
  • The v100 by compiling compute_35 (and higher)-compatible PTX into an sm_70-compatible cubin.
  • The a100 by JIT compiling the embedded compute_35 (and higher)-compatible PTX at runtime.

CUDA libraries

The NVHPC and CUDA modules offer many CUDA accelerated math libraries, like cuBLAS, cuSOLVER and cuFFT.

For C/C++ code, since using many these libraries do not require writing CUDA code, using the nvcc compiler is optional. We refer to the documentation for the specific libraries for how to link them, but give examples of linking cuBLAS and cuFFT with nvcc and the GNU compilers directly.

Take care to link to libraries that are all distributed in the same version of CUDA, to use a version of CUDA compatible with the desired GPUs, and (if using shared libraries) to load the same version of CUDA when running the executable.

Using CUDA shared libraries with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvcc foo.c -o foo \
    -lcublas \  # For BLAS
    -lcufft  # For cuFFT
$ ./foo
Using CUDA shared libraries with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvfortran foo.f90 -o foo \
    -cudalib=cublas \  # For cuBLAS
    -cudalib=lcufft  # For cuFFT
$ ./foo

Fortran support for linking to the CUDA libraries is limited to NVIDIA's compilers. See NVIDIA's Fortran CUDA interfaces for more information.

Using CUDA shared libraries with fosscuda
1
2
3
4
5
6
$ module load fosscuda/2020a
$ gcc foo.c -o foo \
    -lcudart  \ # For CUDA runtime routines like memory management
    -lcublas \  # For cuBLAS
    -lcufft  # For cuFFT
$ ./foo

cuDNN

A popular set of CUDA libraries not included in the CUDA toolkit is cuDNN. On the HPCC, cuDNN is available as a module. Search for available versions with

1
module spider cuDNN

and choose one which uses a version of CUDA compatible with any other GPU-based work you may be doing. See NVIDIA's API reference for which libraries to link against.

GPU offloading

Parts of code can be offloaded onto the GPUs using directive-based APIs like OpenMP and OpenACC. Currently, the recommended approach is to use OpenACC with the NVIDIA HPC SDK compilers.

OpenACC

Offloading with OpenACC is primarily supported by the NVIDIA's compilers in the NVHPC modules. Using the -acc option will activate OpenACC and run kernels by default on the GPU.

The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).

GPU offloading with OpenACC with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvc foo.c -o foo \
    -acc \  # To use OpenACC (default is on GPU)
    -gpu=cc35,cc37,cc70,cc80  # To embed GPU code for various compute capabilities
$ ./foo
GPU offloading with OpenACC with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvfortran foo.c -o foo \
    -acc \  # To use OpenACC (default is on GPU)
    -gpu=cc35,cc37,cc70,cc80  # To embed GPU code for various compute capabilities
$ ./foo

The same versions of the GCC modules discussed in the OpenMP section above support OpenACC, however with the same caveats. Experiment with these versions at your own risk.

GPU offloading with OpenMP with GCC
1
2
3
4
5
6
7
$ module load GCC/11.1.0-cuda-9.2.88
$ gcc foo.c -o foo \
    -fopenacc \  # To activate OpenACC instructions
    -foffload=nvptx-none="-lm"  # To offload code to NVIDIA GPUs, and
                                # ensure that offloaded code has access
                                # to math libraries
$ ./foo
GPU offloading with OpenMP with GCC
1
2
3
4
5
6
7
8
$ module load GCC/11.1.0-cuda-9.2.88
$ gfortran foo.c -o foo \
    -fopenacc \  # To activate OpenMP instructions
    -foffload=nvptx-none="-lm -lgfortran"  # To offload code to NVIDIA
                                           # GPUs, and # ensure that 
                                           # offloaded code has access 
                                           # to math and Fortran libraries
$ ./foo

OpenMP

Offloading with OpenMP is primarily supported by the NVIDIA's compilers in the NVHPC modules. Using the -mp=gpu option will set OpenMP code to use a GPU as a target device.

The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).

GPU offloading with OpenMP with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvcc foo.c -o foo \
    -mp=gpu \  # To use OpenMP on GPU
    -gpu=cc35,cc37,cc70,cc80  # Embed GPU code for compute capabilities
$ ./foo
GPU offloading with OpenMP with NVHPC
1
2
3
4
5
$ module load NVHPC/21.9-GCCcore-10.3.0-CUDA-11.4
$ nvfortran foo.c -o foo \
    -mp=gpu \  # To use OpenMP on GPU
    -gpu=cc35,cc37,cc70,cc80  # Embed GPU code for compute capabilities
$ ./foo

A few versions of the GCC modules available on the HPCC have highly experimental support for offloading OpenMP code to GPUs. These versions include a -cuda or -offload suffix in the version name. Use

1
module spider GCC

to search for versions including these suffixes.

Support for these compilers is very limited, and simple tests indicate that they can suffer from reduced performance in comparison to NVIDIA's compilers or running multi-threaded (or, in extreme cases, single-threaded) on the CPU. Experiment with these versions at your own risk.

GPU offloading with OpenMP with GCC
1
2
3
4
5
6
7
$ module load GCC/11.1.0-cuda-9.2.88
$ gcc foo.c -o foo \
    -fopenmp \  # To activate OpenMP instructions
    -foffload=nvptx-none="-lm"  # To offload code to NVIDIA GPUs, and
                                # ensure that offloaded code has access
                                # to math libraries
$ ./foo
GPU offloading with OpenMP with GCC
1
2
3
4
5
6
7
8
$ module load GCC/11.1.0-cuda-9.2.88
$ gfortran foo.c -o foo \
    -fopenmp \  # To activate OpenMP instructions
    -foffload=nvptx-none="-lm -lgfortran"  # To offload code to NVIDIA
                                           # GPUs, and # ensure that 
                                           # offloaded code has access 
                                           # to math and Fortran libraries
$ ./foo

Other GPU specific compilation options can be passed in the quotes following -foffload=nvptx-none=, e.g., -foffload=nvptx-none="-lm -latomic -O3".