SLURM resource request guide

This guide will help you identify ways you can improve your SLURM job submission scripts to request the appropriate resources for your jobs. This will help your jobs queue efficiently.

Note

Walltime is the time taken measured by a "clock on the wall", i.e. it is the time taken for your code to run, and the time you will request in a SLURM script. CPU time is the amount of time used per CPU, so if 5 CPUs are used for the entire walltime, the CPU time will be 5 times the walltime.

Measure your resource requirements

The first step to obtain the correct resource requirements is to understand your code's resource usage. There are many ways to do this, so this guide will not be exhaustive.

Code documentation

If your code has documentation, this should be the first place to go. The code documentation or descriptive journal article may have guidelines for resource requirements such as memory usage, walltime, and number of CPU cores. If the code is currently maintained or you have a support agreement with the software producer, you can contact the developers to ask them about their experience with HPCC resource requirements.

Local testing

If your code can be run on a local computer such as your laptop, you can easily estimate the required walltime by timing a run of the code. You may also be able to estimate the required CPU cores and memory using your computer's resource monitor (Task Manager for Windows, Activity Monitor on Mac). For timing on Linux systems, you can use the time command. This is run as time <your process name> and will return 3 time measurements: real, user and sys. real is the equivalent of walltime, while user and sys are CPU time measurements. See this Wikipedia article for more information: https://en.wikipedia.org/wiki/Time_%28Unix%29.

Development node testing

Note

Development nodes have a maximum CPU time of 2 hours for processes. They are also used by other users which will affect CPU and GPU usage estimates.

Our development nodes are a potentially useful place to investigate your code's resource requirements for short jobs (< 2 CPU hours). Note that each additional CPU you use reduces your total allowed process time. Testing is best done when the dev node reports low usage.

You can use the Linux tool top to measure memory and CPU usage. Some development nodes (dev-amd20-v100 and dev-amd24-h200) have access to GPUs to help you determine GPU resource requirements. As mentioned in Local testing you can use the time command to measure walltime and CPU time, though this may be inaccurate in some cases.

Basic SLURM run

Ideally you will have been able to estimate your resource requirements using documentation or a local computer before this step. Then you can use these estimates for your SLURM run. If not, you will need to use a permissive resource request with a large amount of memory and walltime so that you can measure your code's needs. You should expect the queue time for this test job to be long. For faster queuing of test jobs, request walltimes of less than 4 hours.

After the run completes (or all the walltime is used), you can determine the approximate resource requirements of your code by inspecting the amount of time taken. For more in-depth statistics, see seff and reportseff below.

For jobs that you expect to take longer than 4 hours, you will need to understand your code's scaling. Scaling is how your code changes its run time as more CPUs or GPUs are used to run the code. To measure scaling, you can run your code a few times with increasing resource requests each time and measure how long it takes. Then you can fit a simple linear or exponential function to these points and approximate intermediate requests, or extrapolate larger resource requests.

`seff` and `reportseff`

seff and reportseff are useful tools for investigating your resource request efficiency. They will provide statistics for individual jobs (seff <job id>) or a report of multiple jobs (reportseff -u <user name>). seff statistics list the used and requested resources as well as a percentage efficiency. reportseff statistics include:

the time efficiency of the job (TimeEff), which is the percentage use of the requested walltime;
the CPU efficiency of the job (CPUEff), which is the percentage use of the requested CPU cores;
the memory efficiency of the job (MemEff), which is the percentage use of the requested memory.

You can use these tools to get a quick measurement of your resource request usage.

Note

reportseff can sometimes generate malformed output if the number of lines overflow the terminal window, in such cases pipe the output of 'reportseff' through more: reportseff <options> | more.

The reportseff developers can be reached at https://github.com/troycomi/reportseff.

SLURM resource request guide

Measure your resource requirements

Code documentation

Local testing

Development node testing

Basic SLURM run

seff and reportseff

`seff` and `reportseff`