SLURM resource request guide
This guide will help you identify ways you can improve your SLURM job submission scripts to request the appropriate resources for your jobs. This will help your jobs queue efficiently.
Note
Walltime is the time taken measured by a "clock on the wall", i.e. it is the time taken for your code to run, and the time you will request in a SLURM script. CPU time is the amount of time used per CPU, so if 5 CPUs are used for the entire walltime, the CPU time will be 5 times the walltime.
Measure your resource requirements
The first step to obtain the correct resource requirements is to understand your code's resource usage. There are many ways to do this, so this guide will not be exhaustive.
Code documentation
If your code has documentation, this should be the first place to go. The code documentation or descriptive journal article may have guidelines for resource requirements such as memory usage, walltime, and number of CPU cores. If the code is currently maintained or you have a support agreement with the software producer, you can contact the developers to ask them about their experience with HPCC resource requirements.
Local testing
If your code can be run on a local computer such as your laptop, you can easily
estimate the required walltime by timing a run of the code. You may also be able
to estimate the required CPU cores and memory using your computer's resource
monitor (Task Manager for Windows, Activity Monitor on Mac). For timing on Linux systems,
you can use the time
command. This is run as time <your process name>
and will return 3 time
meaurements: real
, user
and sys
. real
is the equivalent of walltime,
while user
and sys
are CPU time measurements. See this Wikipedia article
for more information: https://en.wikipedia.org/wiki/Time_%28Unix%29.
Development node testing
Note
Development nodes have a maximum CPU time of 2 hours for processes. They are also used by other users which will affect CPU and GPU usage estimates.
Our development nodes are a potentially useful place to investigate your code's resource requirements for short jobs (< 2 CPU hours). Note that each additional CPU you use reduces your total allowed process time. Testing is best done when the dev node reports low usage.
You can use the Linux tool top
to measure memory and CPU usage. Some
development nodes (dev-intel16-k80
and dev-amd20-v100
) have
access to GPUs to help you determine GPU resource requirements.
As mentioned in Local testing you can use the time
command
to measure walltime and CPU time, though this may be inaccurate in some cases.
Basic SLURM run
Ideally you will have been able to estimate your resource requirements using documentation or a local computer before this step. Then you can use these estimates for your SLURM run. If not, you will need to use a permissive resource request with a large amount of memory and walltime so that you can measure your code's needs. You should expect the queue time for this test job to be long. For faster queuing of test jobs, request walltimes of less than 4 hours.
After the run completes (or all the walltime is used), you can determine the
approximate resource requirements of your code by inspecting the amount of time
taken. For more in-depth statistics, see
seff
and reportseff
below.
For jobs that you expect to take longer than 4 hours, you will need to understand your code's scaling. Scaling is how your code changes its run time as more CPUs or GPUs are used to run the code. To measure scaling, you can run your code a few times with increasing resource requests each time and measure how long it takes. Then you can fit a simple linear or exponential function to these points and approximate intermediate requests, or extrapolate larger resource requests.
seff
and reportseff
seff
and reportseff
are useful tools for investigating your resource request
efficiency. They will provide statistics for individual jobs (seff <job id>
)
or a report of multiple jobs (reportseff -u <user name>
). seff
statistics
list the used and requested resources as well as a percentage efficiency.
reportseff
statistics include:
- the time efficiency of the job (TimeEff), which is the percentage use of the requested walltime;
- the CPU efficiency of the job (CPUEff), which is the percentage use of the requested CPU cores;
- the memory efficiency of the job (MemEff), which is the percentage use of the requested memory.
You can use these tools to get a quick measurement of your resource request usage.
Note
reportseff
can sometimes generate malformed output if the number of lines overflow the terminal window,
in such cases pipe the ouput of 'reportseff' through more: reportseff <options> | more
.
The reportseff
developers can be reached at
https://github.com/troycomi/reportseff.