Skip to content

Getting started with Grace Hopper and Grace Grace

ICER has received four NVIDIA Grace Hopper systems that MSU researchers have purchased as well as one NVIDIA Grace Grace CPU system.

All nodes are available in SLURM and are available to the research community under the same buy-in rules as the rest of our buy-in hardware. In particular, users that are not part of a node's buy-in are restricted to submitting jobs less than four hours.

Node listing

Cluster Type Node Count Processors Cores Memory Disk Size GPUs (Number) Node Name
Grace Hopper 3 Grace CPU (Arm Neoverse v2) 72 480 GB 1.5TB GH200 96 GB (1) nch-[000-002]
Grace Hopper 1 Grace CPU (Arm Neoverse v2) 72 480 GB 3.2TB GH200 96 GB (1) nch-003
Grace Grace 1 Grace CPU (Arm Neoverse v2) 144 480 GB 3.5TB ncc-000

Differences from other nodes

The Grace systems are currently considered "beta" with very minimal installations. Users should be aware:

  1. These are ARM-based (aarch64) systems. Existing code compiled for our x86_64 nodes (including conda environments) will not work on this system. To see if an executable is compiled, use file executablename. If it mentions x86_64 it is not compatible with the Grace Hopper systems.
  2. These nodes have slower-than-normal home directory and research space access. Researchers may wish to stage data or code to the node in /tmp.
  3. Software pre-installed by ICER is different than what is available on other nodes. To access the module command and all software currently built for the Grace Hopper systems, run

    1
    source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module use /opt/modules/all
    

    Please note that you may see references to modules on the main cluster being inactive, as SLURM copies these references over before the line above deactivates them. These modules are non-functional on the Grace nodes.

  4. Before running code that has been compiled on the Grace nodes, you need to load the Compiled module after all other dependent modules. For example, assuming that a code is compiled using the GCC/12.2.0, Cabana/0.6.1-foss-2022b-CUDA-12.1.1, and CUDA/12.1.1 modules, you should use the following lines to run your code:

    1
    2
    3
    4
    5
    module purge
    module load GCC/12.2.0 Cabana/0.6.1-foss-2022b-CUDA-12.1.1 CUDA/12.1.1
    module load Compiled
    
    ./my-compiled-code
    

We will update this page as we address these issues.

Submitting jobs

The primary mechanism to schedule jobs on Grace nodes is to use the SLURM option --constraint NOAUTO:grace.

Interactive jobs

To start an interactive job, use the template

1
salloc --constraint=NOAUTO:grace --time=3:00:00 --gpus=1 --cpus-per-task=72 --mem=10G

If you would like to avoid using your default shell environment (due to any of the potential incompatibilities, see above), you should use srun with option --pty and command /bin/bash, e.g.,

1
srun --constraint=NOAUTO:grace --time=3:00:00 --gpus=1 --cpus-per-task=72 --mem=10G --pty /bin/bash

If you have buy-in access to a Grace node, you should additionally add the option --account=<buy-in-name>. By using the --gpus:1 option, jobs are restricted to only running on Grace Hopper nodes. Removing this options allows jobs to run on any Grace node.

Job script template

grace_hopper_template.sb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash

#SBATCH --constraint=NOAUTO:grace  # Only run on Grace nodes
#SBATCH --time=3:00:00             # Run for three hours
#SBATCH --gpus=1                   # Request one GPU (restricts to Grace Hopper)
#SBATCH --cpus-per-task=72         # Request all CPUs on a Grace Hopper node 
#SBATCH --mem=10GB                 # Request 10GB of (non-GPU) memory

echo "This script is from ICER's Grace Hopper & Grace Grace how-to"

# Gain access to the module system
source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module use /opt/modules/all

# Load modules
module load CUDA-Samples

# Run code (GPU examples from CUDA-Samples)
matrixMul
matrixMulCUBLAS

# Output debugging information
scontrol show job $SLURM_JOB_ID

See also

Users may wish to refer to NVIDIA's documentation: