Note
This page does not apply to nodes with Hopper (H200) GPUs
such as dev-amd24-h200
and the nfh
and neh
compute nodes.
Getting started with Grace Hopper and Grace Grace
ICER has received four NVIDIA Grace Hopper systems that MSU researchers have purchased as well as one NVIDIA Grace Grace CPU system.
All nodes are available in SLURM and are available to the research community under the same buy-in rules as the rest of our buy-in hardware. In particular, users that are not part of a node's buy-in are restricted to submitting jobs less than four hours.
Node listing
Cluster Type | Node Count | Processors | Cores | Memory | Disk Size | GPUs (Number) | Node Name |
---|---|---|---|---|---|---|---|
Grace Hopper | 3 | Grace CPU (Arm Neoverse v2) | 72 | 480 GB | 1.5TB | GH200 96 GB (1) | nch-[000-002] |
Grace Hopper | 1 | Grace CPU (Arm Neoverse v2) | 72 | 480 GB | 3.2TB | GH200 96 GB (1) | nch-003 |
Grace Grace | 1 | Grace CPU (Arm Neoverse v2) | 144 | 480 GB | 3.5TB | ncc-000 |
Differences from other nodes
The Grace systems are currently considered "beta" with very minimal installations. Users should be aware:
- These are ARM-based (
aarch64
) systems. Existing code compiled for ourx86_64
nodes (including conda environments) will not work on this system. To see if an executable is compiled, usefile executablename
. If it mentionsx86_64
it is not compatible with the Grace Hopper systems. - These nodes have slower-than-normal home directory and research space access. Researchers may wish to stage data or code to the node in
/tmp
. -
Software pre-installed by ICER is different than what is available on other nodes. To access the
module
command and all software currently built for the Grace Hopper systems, runsource /cvmfs/software.eessi.io/versions/2023.06/init/bash && module use /opt/modules/all
Please note that you may see references to modules on the main cluster being inactive, as SLURM copies these references over before the line above deactivates them. These modules are non-functional on the Grace nodes.
-
Before running code that has been compiled on the Grace nodes, you need to load the
Compiled
module after all other dependent modules. For example, assuming that a code is compiled using theGCC/12.2.0
,Cabana/0.6.1-foss-2022b-CUDA-12.1.1
, andCUDA/12.1.1
modules, you should use the following lines to run your code:module purge module load GCC/12.2.0 Cabana/0.6.1-foss-2022b-CUDA-12.1.1 CUDA/12.1.1 module load Compiled ./my-compiled-code
We will update this page as we address these issues.
Submitting jobs
The primary mechanism to schedule jobs on Grace nodes is to use the SLURM option --constraint NOAUTO:grace
.
Interactive jobs
To start an interactive job, use the template
salloc --constraint=NOAUTO:grace --time=3:00:00 --gpus=1 --cpus-per-task=72 --mem=10G
If you would like to avoid using your default shell environment (due to any of the potential incompatibilities, see above), you should use srun
with option --pty
and command /bin/bash
, e.g.,
srun --constraint=NOAUTO:grace --time=3:00:00 --gpus=1 --cpus-per-task=72 --mem=10G --pty /bin/bash
If you have buy-in access to a Grace node, you should additionally add the option --account=<buy-in-name>
. By using the --gpus:1
option, jobs are restricted to only running on Grace Hopper nodes. Removing this options allows jobs to run on any Grace node.
Job script template
#!/bin/bash
#SBATCH --constraint=NOAUTO:grace # Only run on Grace nodes
#SBATCH --time=3:00:00 # Run for three hours
#SBATCH --gpus=1 # Request one GPU (restricts to Grace Hopper)
#SBATCH --cpus-per-task=72 # Request all CPUs on a Grace Hopper node
#SBATCH --mem=10GB # Request 10GB of (non-GPU) memory
echo "This script is from ICER's Grace Hopper & Grace Grace how-to"
# Gain access to the module system
source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module use /opt/modules/all
# Load modules
module load CUDA-Samples
# Run code (GPU examples from CUDA-Samples)
matrixMul
matrixMulCUBLAS
# Output debugging information
scontrol show job $SLURM_JOB_ID
See also
Users may wish to refer to NVIDIA's documentation: