The HPCC's GPU Resources
The HPCC offers several generations of GPUs as noted in the general Cluster Resources page. More information about these devices is provided in the table below. The cluster types that each GPU is associated with correspond to the cluster types listed in the Cluster Resources table.
GPU | Cluster Type | Number per Node | GPU Memory | Architecture | Compute Capability | Connection Type | NVLink |
---|---|---|---|---|---|---|---|
a100 |
amd21* | 4 | 81920 MB | Ampere | 8.0 | SXM | Yes |
intel21 | 4 | 40960 MB | Ampere | 8.0 | PCIe | No | |
v100 |
amd20 | 4 | 32768 MB | Volta | 7.0 | PCIe | Mixed |
intel18 | 8 | 32768 MB | Volta | 7.0 | PCIe | Yes | |
k80 |
intel16 | 8 | 12206 MB | Kepler | 3.7 | PCIe | No |
k20 |
intel14 | 2 | 4743 MB | Kepler | 3.5 | PCIe | No |
*The amd21 cluster contains some nodes that belong the to Data Machine.
Architecture & Compute Capability
Currently, all of the HPCC's GPUs are manufactured by NVIDIA. They are designed following multiple architectures. Knowing a GPU's architecture aids in researching their technical specifications. A GPU's architecture is abbreviated in it's name; for example, the V100 GPUs follow the Volta architecture.
Specific architectures and models of GPUs are able to meet certain compute capabilities (CC): sets of features that applications can leverage when executing on that GPU. Newer GPUs offer more advanced features and therefore adhere to a newer version of NVIDIA's compute capabilities. An explanation of what features are available for each compute capability can be found on both as part of the CUDA documentation (CC > 5.0 only) and compiled on Wikipedia.
Developers may use the CUDA programming language to utilize our GPUs in their software applications. See our page on Compiling for GPUs for more information on which versions of CUDA may be used for each of the HPCC's GPUs and their respective compute capabilities.
Connection Type
Most of the HPCC's GPUs communicate with the CPUs of their host node via the PCIe (Peripheral Component Interconnect Express) bus. This bus is the primary channel by which data and instructions are transferred to and from the GPU. As such, the speed of this bus can affect the speed of GPU applications where large amounts of data transfer are a concern. In contrast, the A100 GPUs associated with the amd21 clusters are connected using SXM (Server PCI Express Module) sockets which offer higher connection speeds. Research the specifications of the particular GPU you are planning to use to learn more specifics about their bus's bandwidth.
NVLink
While PCIe and SXM refer to the connection between the CPU and GPU, some of the HPCC's V100 and A100 GPUs are also connected to each other using NVIDIA's NVLink technology. NVLink allows GPUs to directly share data with each other. Without NVLink, transferring data from one GPU to another would require that the data first pass through the CPU. Using the CPU as a data transfer "middleman" adds to overall time the transfer takes and may also delay the CPU from communicating additional data and instructions to the GPUs. If you plan to use multiple GPUs for your job, consider requesting resources that support NVLink as indicated in the table above.
Some of the amd20 nodes support NVLink while others do not. You can check whether or not a given node supports NVLink by requesting a job on that node and connecting to it. Specific nodes can be requested with the -w
or --nodelist
option; see the list of job specifications for more. Then, once connected, run nvidia-smi nvlink -s
to check the status of the node's NVLink connection.