Using different cluster architectures

The HPCC is composed of many clusters (e.g., intel18 or amd20) which are each composed of nodes with similar CPU types. However, the CPU types between these clusters may be different and have different capabilities. The different types of CPUs are generally referred to as different (micro)architectures.

Mismatching code and CPU architectures will generally lead to "Illegal instruction errors." This page explains the cause, how the module system is set up to manage this, and what steps you should take to avoid problems on the HPCC.

CPU features and incompatibilities

To turn code into something that can run on a CPU, it needs to be compiled. The software that does this (the compiler) will generally optimize the code to run on the CPU it is compiling the code for. Depending on the CPU architecture, it may use special features available to that CPU to do so.

The best example of this is the AVX-512 instruction set, which extends the capabilities of some CPUs to be able to operate on elements of large vectors simultaneously. In the HPCC, only the intel18 and amd24 clusters have this capability.

When code is compiled using AVX-512 instructions and is run on a CPU without AVX-512 capability, the code will fail with an "Illegal instruction error", and you will likely get a "core dump" in your working directory (this is a large file containing the entirety of the program's state in memory that can be used for debugging).

Cluster type and the module system

The HPCC provides many different types of software available through the module system. Most of this software is compiled "generically" which means it is available for any cluster type (though is not optimized). Some software is compiled specifically for other cluster types for special compatibility or optimization reasons.

You can see which software is compiled for which cluster architectures in our available software table. The module system will automatically load the version optimized for the cluster architecture you are using if available.

Installing software yourself

If you install/compile software yourself (note that this often includes installing R or Python packages), you need to be aware of the node you installed the software on versus the node you run the code on. Mismatching can cause the illegal instruction errors as explained above.

Example

You install your software on dev-intel18. You submit your job and it runs on an amd20 node. You could receive an illegal instruction error, especially if the code uses AVX-512 (which may not be clear to you as the end user).

For this reason, we recommend that you either install your software generically or optimize with certain caveats.

Installing software generically

To ensure compatibility on the cluster, install your code using a node with the most compatible CPU features. As of this writing, use intel16 nodes (e.g., the dev-intel16 development node) to ensure the most compatibility.

Installing software optimized to a single cluster architecture

If you install your software on any other cluster type, it is not guaranteed to work on other nodes. For this reason, you should constrain any jobs using that software to that cluster type.

Example

You install your software on dev-intel18. You should add the line

#SBATCH --constraint=intel18

to any SLURM scripts using that code or in OnDemand, under the "Advanced Options" section when starting an app, restrict the code to run on the intel18 node type.

You may also consider installing multiple versions of your code for different cluster types like ICER does with the module system. This can increase the number of nodes you can use and reduce your wait time in the queue. However, this requires extra logic in your scripts that will depend on how you install the software and may be difficult to maintain.