Skip to content

AMD CPUs

amd20

In 2020, the HPCC purchased the amd20 cluster powered by AMD EPYC 7H12 processors (part of the 7002 series). Each amd20 node has total 128 cores and at least 0.5 TB RAM. Each core has a base clock speed 2.6 GHz, up to 3.3 GHz. 

Basic architecture information

Each AMD CPU contains two sockets with one EPYC 7H12 processor each. These processors are are divided into and I/O die and four "Core Complex Dies" (CCD). These dies are visible in the image of a 7002 series processor below.

Photograph of a 7002 series AMD processor.

Each of the four Core Complex Dies is in turn made of two "Core Complexes" (CCX) with four Zen 2 cores each. Each of these four-core Core Complex shares a 16 MB L3 cache. A logical diagram for a full CCD is shown below, with an orange line dividing the two CCXs.

A logical diagram of the 7002 series' Core Complex.

An example of the entire node layout for a 1 TB amd20 node is shown below:

A diagram showing the layout of an entire amd20 node.

Performance advice

Generally, cores within the same L3 cache have the lowest latency, followed by cores in the same CCD, and other cores on the same socket/processor. Cores in the other socket are slower. The SLURM job scheduler on the HPCC will try to keep within the same CCD (also called "NUMA domain") by default. One option to try during testing is to use OpenMP within the L3 node, and MPI for everything else. When using newer versions of OpenMPI, you can use the following argument with OMP_NUM_THREADS=4 to distribute one 4-thread rank per L3:

1
mpirun -map-by ppr:1:l3cache:pe=4 --bind-to core

In testing, the Intel Compiler and MKL toolchain works well if you export MKL_DEBUG_CPU_TYPE=5 and compile for AVX2 instead of AVX512.

If you’re doing single-node scaling work, be aware of memory bandwidth; on these nodes, HPL scales from 1-96 cores linearly but only 3.5 -> 4 TF from 96->128 when going from 3 cores per L3 to 4.

Each node has a 100 gigabit HDR100 connection. There are 52-56 nodes per switch:

A network diagram of the amd20 nodes showing 4 clusters of nodes.

Resources:

High Performance Computing: Tuning Guide for AMD EPYC™ 7002 Series Processors

Compiler Options Quick Ref Guide for AMD EPYC 7xx2 Series Processors

amd24

The CPUs purchased for the amd24 cluster use AMD EPYC 9654 and 9684X processors. These processors are part of the 9004 series. As with amd20, each node has two sockets with each socket populated by a processor.

These processors use the same "Core Complex Die" (CCD) system as the processors used with amd20; however, for the 96XX processors there is only one Core Complex (CCX) per CCD. These processors have twelve CCDs each with eight Zen4 cores. A logical diagram of the CCD is shown below:

A logical diagram of a 96XX processor core complex

A schematic of the whole 9004 series processor is below:

A schematic of a 9004 series AMD processor

As with the amd20 CPUs, latency between cores will be lowest within a CCD. If using both MPI and OpenMP, we can again group threads based on L3 cache using

1
mpirun -map-by ppr:1:l3cache:pe=8 --bind-to core

For more on the architecture of the 9004 series, see this documentation from AMD.