AMD CPUs
amd20
In 2020, the HPCC purchased the amd20
cluster powered by AMD EPYC 7H12 processors (part of the 7002 series). Each amd20
node has total 128 cores and at least 0.5 TB RAM. Each core has a base clock speed 2.6 GHz, up to 3.3 GHz.
Basic architecture information
Each AMD CPU contains two sockets with one EPYC 7H12 processor each. These processors are are divided into and I/O die and four "Core Complex Dies" (CCD). These dies are visible in the image of a 7002 series processor below.
Each of the four Core Complex Dies is in turn made of two "Core Complexes" (CCX) with four Zen 2 cores each. Each of these four-core Core Complex shares a 16 MB L3 cache. A logical diagram for a full CCD is shown below, with an orange line dividing the two CCXs.
An example of the entire node layout for a 1 TB amd20
node is shown below:
Performance advice
Generally, cores within the same L3 cache have the lowest latency,
followed by cores in the same CCD, and other cores on the same socket/processor. Cores in the other socket are slower.
The SLURM job scheduler on the HPCC will
try to keep within the same CCD (also called "NUMA domain") by default.
One option to try
during testing is to use OpenMP within the L3 node, and MPI for
everything else. When using newer versions of OpenMPI, you can use the
following argument with OMP_NUM_THREADS=4
to distribute one 4-thread
rank per L3:
1 |
|
In testing, the Intel Compiler and MKL toolchain works well if you
export MKL_DEBUG_CPU_TYPE=5
and compile for AVX2 instead of AVX512.
If you’re doing single-node scaling work, be aware of memory bandwidth; on these nodes, HPL scales from 1-96 cores linearly but only 3.5 -> 4 TF from 96->128 when going from 3 cores per L3 to 4.
Each node has a 100 gigabit HDR100 connection. There are 52-56 nodes per switch:
Resources:
High Performance Computing: Tuning Guide for AMD EPYC™ 7002 Series Processors
Compiler Options Quick Ref Guide for AMD EPYC 7xx2 Series Processors
amd24
The CPUs purchased for the amd24
cluster use AMD EPYC 9654 and 9684X processors. These processors are part of the 9004 series.
As with amd20
, each node has two sockets with each socket populated by a processor.
These processors use the same "Core Complex Die" (CCD) system as the processors used with amd20
; however, for the 96XX processors there is only
one Core Complex (CCX) per CCD. These processors have twelve CCDs each with
eight Zen4 cores. A logical diagram of the CCD is shown below:
A schematic of the whole 9004 series processor is below:
As with the amd20
CPUs, latency between cores will be lowest within a CCD. If using both MPI and OpenMP, we can again group threads based on L3 cache using
1 |
|
For more on the architecture of the 9004 series, see this documentation from AMD.