Getting started with Grace Hopper and Grace Grace
ICER has received four NVIDIA Grace Hopper systems that MSU researchers have purchased as well as one NVIDIA Grace Grace CPU system.
All nodes are available in SLURM and are available to the research community under the same buy-in rules as the rest of our buy-in hardware. In particular, users that are not part of a node's buy-in are restricted to submitting jobs less than four hours.
Node listing
Cluster Type | Node Count | Processors | Cores | Memory | Disk Size | GPUs (Number) | Node Name |
---|---|---|---|---|---|---|---|
Grace Hopper | 3 | Grace CPU (Arm Neoverse v2) | 72 | 480 GB | 1.5TB | GH200 96 GB (1) | nch-[000-002] |
Grace Hopper | 1 | Grace CPU (Arm Neoverse v2) | 72 | 480 GB | 3.2TB | GH200 96 GB (1) | nch-003 |
Grace Grace | 1 | Grace CPU (Arm Neoverse v2) | 144 | 480 GB | 3.5TB | ncc-000 |
Differences from other nodes
The Grace systems are currently considered "beta" with very minimal installations. Users should be aware:
- These are ARM-based (
aarch64
) systems. Existing code compiled for ourx86_64
nodes (including conda environments) will not work on this system. To see if an executable is compiled, usefile executablename
. If it mentionsx86_64
it is not compatible with the Grace Hopper systems. - These nodes have slower-than-normal home directory and research space access. Researchers may wish to stage data or code to the node in
/tmp
. -
Software pre-installed by ICER is different than what is available on other nodes. To access the
module
command and all software currently built for the Grace Hopper systems, run1
source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module use /opt/modules/all
Please note that you may see references to modules on the main cluster being inactive, as SLURM copies these references over before the line above deactivates them. These modules are non-functional on the Grace nodes.
-
Before running code that has been compiled on the Grace nodes, you need to load the
Compiled
module after all other dependent modules. For example, assuming that a code is compiled using theGCC/12.2.0
,Cabana/0.6.1-foss-2022b-CUDA-12.1.1
, andCUDA/12.1.1
modules, you should use the following lines to run your code:1 2 3 4 5
module purge module load GCC/12.2.0 Cabana/0.6.1-foss-2022b-CUDA-12.1.1 CUDA/12.1.1 module load Compiled ./my-compiled-code
We will update this page as we address these issues.
Submitting jobs
The primary mechanism to schedule jobs on Grace nodes is to use the SLURM option --constraint NOAUTO:grace
.
Interactive jobs
To start an interactive job, use the template
1 |
|
If you would like to avoid using your default shell environment (due to any of the potential incompatibilities, see above), you should use srun
with option --pty
and command /bin/bash
, e.g.,
1 |
|
If you have buy-in access to a Grace node, you should additionally add the option --account=<buy-in-name>
. By using the --gpus:1
option, jobs are restricted to only running on Grace Hopper nodes. Removing this options allows jobs to run on any Grace node.
Job script template
grace_hopper_template.sb | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
See also
Users may wish to refer to NVIDIA's documentation: