Compiling for GPUs
There are a few different ways for code to access GPUs. This page will focus only on compiling code that uses those techniques rather than explaining the techniques themselves.
For more general information on compilers on the HPCC, see our reference page.
GPU compatibility
Because GPU hardware can change dramatically from device to device, it's important to keep in mind the GPUs your code will run on when compiling. In practical terms on the HPCC, this means selecting version of CUDA that is best suited for your target GPU.
See the table below for the CUDA compute capability and associated CUDA versions of the various GPUs available on the HPCC (additionally, see this CUDA version table and the compute capabilities for GPUs for more information). Using the maximum version or lower will ensure that code compiled with that version of CUDA will be able to run on the associated GPU by default.
GPU | Compute capability | Minimum CUDA version | Maximum CUDA version* |
---|---|---|---|
k20 |
3.5 | 4.2 | 10.2 |
k80 |
3.7 | 4.2 | 10.2 |
v100 |
7.0 | 9.0 | |
a100 |
8.0 | 11.0 |
*Support beyond the maximum version
CUDA versions greater than the maximum CUDA version listed above may still work for certain GPUs (for example, CUDA versions up to 11.4 can work with compute capabilities 3.5 and 3.7), but will not compile for them by default. In these situations, you should explicitly state the minimum compute capability when compiling as discussed below.
Compiling for maximum compatibility
The minimum version listed in the table above is necessary to support all capabilities of a GPU. However, the drivers for NVIDIA GPUs are generally backwards compatible with earlier versions of PTX code (see below). This means you can use an earlier CUDA version to compile code, and it will be able to run on any newer GPUs. For more information, see Compiling for specific GPUs below.
For a good compromise between features and compatibility with all GPUs at ICER, CUDA 9.x or 10.x will work well.
CUDA code
CUDA Fortran (that is, Fortran code with CUDA specific extensions) is compiled
using the same nvfortran
compiler described in the NVHPC
section of our
compiler reference.
CUDA C/C++ code (that is, C/C++ code with CUDA specific extensions to run
kernels on a GPU device), is compiled using the nvcc
compiler. The
recommended way to access this compiler is to load the NVHPC
module as
described in our compiler
reference or a fosscuda
or
gcccuda
toolchain. Alternatively, nvcc
is also included in any of the
CUDA
modules.
Behind the scenes, nvcc
will use gcc/g++
to compile the C/C++ code itself.
The version used is whatever is on your path. This means that if you load
CUDA
by itself, you will be using the system version of gcc/g++
which is
much older than the versions available in the module system. For this reason,
we recommend loading either the NVHPC
modules (which will use the version of
gcc/g++
suffixing the module version), or loading a fosscuda
or gcccuda
module that will load an appropriate version of CUDA
along with the
corresponding foss
or GCC
toolchain.
CUDA with NVHPC | |
---|---|
1 2 3 |
|
CUDA with NVHPC | |
---|---|
1 2 3 |
|
Compiling for specific GPUs
GPU code is compiled in two stages:
- Compiling into a virtual instruction set like assembly code (called PTX)
- Compiling the virtual instructions into binary code (called a cubin) that actually runs on the GPU
These stages are controlled by the compute capability specified to nvcc
(in
the previous examples, this is set implicitly to 5.2) and nvcc
can embed the
results of stage 1, stage 2, or both for various compute capabilities in the
final executable. If the stage 2 output is missing for the compute capability
of the GPU that the code is executed on, the NVIDIA driver will just-in-time
(JIT) compile any stage 1 code it finds at runtime into stage 2 code
appropriate for that GPU.
In general, you should use the lowest compute capability your code supports in step 1 (for the widest compatibility with future JIT compilation) and the compute capability of the target GPU in step 2 (for the best optimization).
To specify compute capability x.y for stage 1, use the -arch=compute_xy
flag,
and for stage 2, use the -code=sm_xy
flag. You can also specify
-code=compute_xy
to embed the output of stage 1 into the final binary for JIT
compilation. Multiple compute_xy
and sm_xy
values can be supplied to
-code
in a comma separated list.
See NVIDIA's documentation on GPU compilation for more information and examples.
Compiling for k20
and k80
GPUs with CUDA 11.4
As discussed above, CUDA 11.4 will not compile for
k20
and k80
GPUs by default. However, we can specify the corresponding
compute capabilities explicitly:
1 2 3 4 |
|
The resulting executable will be able to run on all GPUs at ICER:
- The
k20
by compilingcompute_35
(and higher)-compatible PTX into ansm_35
-compatiblecubin
. - The
k80
by compilingcompute_35
(and higher)-compatible PTX into ansm_37
-compatiblecubin
. - The
v100
by compilingcompute_35
(and higher)-compatible PTX into ansm_70
-compatiblecubin
. - The
a100
by JIT compiling the embeddedcompute_35
(and higher)-compatible PTX at runtime.
CUDA libraries
The NVHPC
and CUDA
modules offer many CUDA accelerated math
libraries, like
cuBLAS, cuSOLVER and cuFFT.
For C/C++ code, since using many these libraries do not require writing CUDA
code, using the nvcc
compiler is optional. We refer to the documentation for
the specific libraries for how to link them, but give examples of linking
cuBLAS
and cuFFT
with nvcc
and the GNU compilers directly.
Take care to link to libraries that are all distributed in the same version of CUDA, to use a version of CUDA compatible with the desired GPUs, and (if using shared libraries) to load the same version of CUDA when running the executable.
Using CUDA shared libraries with NVHPC | |
---|---|
1 2 3 4 5 |
|
Using CUDA shared libraries with NVHPC | |
---|---|
1 2 3 4 5 |
|
Fortran support for linking to the CUDA libraries is limited to NVIDIA's compilers. See NVIDIA's Fortran CUDA interfaces for more information.
Using CUDA shared libraries with fosscuda | |
---|---|
1 2 3 4 5 6 |
|
cuDNN
A popular set of CUDA libraries not included in the CUDA toolkit is cuDNN. On the HPCC, cuDNN is available as a module. Search for available versions with
1 |
|
and choose one which uses a version of CUDA compatible with any other GPU-based work you may be doing. See NVIDIA's API reference for which libraries to link against.
GPU offloading
Parts of code can be offloaded onto the GPUs using directive-based APIs like OpenMP and OpenACC. Currently, the recommended approach is to use OpenACC with the NVIDIA HPC SDK compilers.
OpenACC
Offloading with OpenACC is primarily supported by the NVIDIA's compilers in the
NVHPC
modules. Using the -acc
option will activate OpenACC and run kernels
by default on the GPU.
The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).
GPU offloading with OpenACC with NVHPC | |
---|---|
1 2 3 4 5 |
|
GPU offloading with OpenACC with NVHPC | |
---|---|
1 2 3 4 5 |
|
The same versions of the GCC
modules discussed in the OpenMP
section above support OpenACC, however with the same caveats. Experiment with
these versions at your own risk.
GPU offloading with OpenMP with GCC | |
---|---|
1 2 3 4 5 6 7 |
|
GPU offloading with OpenMP with GCC | |
---|---|
1 2 3 4 5 6 7 8 |
|
OpenMP
Offloading with OpenMP is primarily supported by the NVIDIA's compilers in the
NVHPC
modules. Using the -mp=gpu
option will set OpenMP code to use a GPU
as a target device.
The specific compute capabilities of the desired target GPUs can also be passed to compile compatible binaries for the respective GPUs. GPUs with other compute capabilities will incur a slight one-time cost when the executable is run (so that embedded PTX code can be JIT compiled to appropriate binary).
GPU offloading with OpenMP with NVHPC | |
---|---|
1 2 3 4 5 |
|
GPU offloading with OpenMP with NVHPC | |
---|---|
1 2 3 4 5 |
|
A few versions of the GCC
modules available on the HPCC have highly
experimental support for offloading OpenMP code to GPUs. These versions include
a -cuda
or -offload
suffix in the version name. Use
1 |
|
to search for versions including these suffixes.
Support for these compilers is very limited, and simple tests indicate that they can suffer from reduced performance in comparison to NVIDIA's compilers or running multi-threaded (or, in extreme cases, single-threaded) on the CPU. Experiment with these versions at your own risk.
GPU offloading with OpenMP with GCC | |
---|---|
1 2 3 4 5 6 7 |
|
GPU offloading with OpenMP with GCC | |
---|---|
1 2 3 4 5 6 7 8 |
|
Other GPU specific compilation options can be passed in the quotes following
-foffload=nvptx-none=
, e.g., -foffload=nvptx-none="-lm -latomic -O3"
.