Showing job steps
SLURM provides commands to show the execution information of each command line in a job script. This can be helpful for debugging and testing. In order to get such information, the wrapper command srun
needs to be used. Let's take a look at the following job script:
#!/bin/bash
#SBATCH -N 4 -n 4 -c 2
#SBATCH --time=00:05:00
#SBATCH --mem=1G
echo "This script is from ICER's SLURM job steps tutorial"
module purge; module load GCC/6.4.0-2.28 OpenMPI/2.1.2
module list
mpicc mpi-hello.c -o hello.exe
echo; echo "====== mpirun hello.exe ======"
mpirun hello.exe #0 Step
echo; echo "====== srun hello.exe ======"
srun hello.exe #1 Step
echo; echo "====== srun -n 8 -c 1 hello.exe ======"
srun -n 8 -c 1 hello.exe #2 Step
echo; echo "====== srun ======"
srun NoSuchCommand #3 Step
echo; echo "====== mpirun ======"
mpirun NoSuchCommand #4 Step
echo; echo "====== scontrol show job $SLURM_JOB_ID ======"
srun -N 1 -n 1 -c 1 scontrol show job $SLURM_JOB_ID #5 Step
Although there are many command lines, only 6 of them are executed with either mpirun or srun wrapper and marked with the comments (from step 0 to 5) in the end. SLURM can record each of the 6 executions as a job step. Once the job is submitted by sbatch command and starts running, you can use sacct command to check the steps:
$ sacct -j 10732
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10732 test general-l+ classres 8 COMPLETED 0:0
10732.batch batch classres 2 COMPLETED 0:0
10732.extern extern classres 8 COMPLETED 0:0
10732.0 orted classres 6 COMPLETED 0:0
10732.1 hello.exe classres 8 COMPLETED 0:0
10732.2 hello.exe classres 8 COMPLETED 0:0
10732.3 NoSuchCom+ classres 8 FAILED 2:0
10732.4 orted classres 6 COMPLETED 0:0
10732.5 scontrol classres 1 COMPLETED 0:0
where the Job has ID 10732 and the 6 steps are shown from JobID 10732.0 to 10732.5.
We can also use a powertools command js to see more detailed information (such as memory usage and a list of used nodes) about the steps:
$ js -j 10732 -C5
SLURM Job ID: 10732
===============================================================================================================================
JobID | 10732 | 10732.batch | 10732.extern | 10732.0 | 10732.1 |
JobName | test | batch | extern | orted | hello.exe |
User | changc81 | | | | |
NodeList | lac-[380-383] | lac-380 | lac-[380-383] | lac-[381-383] | lac-[380-383] |
NNodes | 4 | 1 | 4 | 3 | 4 |
NTasks | | 1 | 4 | 3 | 4 |
NCPUS | 8 | 2 | 8 | 6 | 8 |
ReqMem | 1Gn | 1Gn | 1Gn | 1Gn | 1Gn |
Timelimit | 00:05:00 | | | | |
Elapsed | 00:00:16 | 00:00:16 | 00:00:16 | 00:00:02 | 00:00:01 |
SystemCPU | 00:03.283 | 00:00.562 | 00:00.001 | 00:00.646 | 00:00.572 |
UserCPU | 00:02.119 | 00:00.753 | 00:00.003 | 00:00.396 | 00:00.281 |
TotalCPU | 00:05.403 | 00:01.316 | 00:00.005 | 00:01.042 | 00:00.853 |
AveCPULoad | 0.337687 | 0.08225 | 0.0003125 | 0.521 | 0.853 |
MaxRSS | | 10409K | 120K | 861K | 863K |
MaxVMSize | | 652100K | 173968K | 324440K | 324436K |
Start | 2018-08-06T13:22:44 | 2018-08-06T13:22:44 | 2018-08-06T13:22:44 | 2018-08-06T13:22:54 | 2018-08-06T13:22:57 |
End | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 | 2018-08-06T13:22:56 | 2018-08-06T13:22:58 |
ExitCode | 0:0 | 0:0 | 0:0 | 0:0 | 0:0 |
State | COMPLETED | COMPLETED | COMPLETED | COMPLETED | COMPLETED |
===============================================================================================================================
JobID | 10732.2 | 10732.3 | 10732.4 | 10732.5 |
JobName | hello.exe | NoSuchCommand | orted | scontrol |
User | | | | |
NodeList | lac-[380-383] | lac-[380-383] | lac-[381-383] | lac-380 |
NNodes | 4 | 4 | 3 | 1 |
NTasks | 8 | 4 | 3 | 1 |
NCPUS | 8 | 8 | 6 | 1 |
ReqMem | 1Gn | 1Gn | 1Gn | 1Gn |
Timelimit | | | | |
Elapsed | 00:00:01 | 00:00:00 | 00:00:01 | 00:00:00 |
SystemCPU | 00:01.141 | 00:00.051 | 00:00.289 | 00:00.017 |
UserCPU | 00:00.521 | 00:00.031 | 00:00.096 | 00:00.035 |
TotalCPU | 00:01.663 | 00:00.083 | 00:00.385 | 00:00.053 |
AveCPULoad | 1.663 | | 0.385 | |
MaxRSS | 34812K | 865K | 865K | 840K |
MaxVMSize | 324436K | 324436K | 324440K | 324436K |
Start | 2018-08-06T13:22:58 | 2018-08-06T13:22:59 | 2018-08-06T13:22:59 | 2018-08-06T13:23:00 |
End | 2018-08-06T13:22:59 | 2018-08-06T13:22:59 | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 |
ExitCode | 0:0 | 2:0 | 0:0 | 0:0 |
State | COMPLETED | FAILED | COMPLETED | COMPLETED |
=========================================================================================================
From the results above, we can see the executions by mpirun are different from srun. First of all, for mpirun, the JobName only show "orted" no matter what commands are used in the steps 10732.0 and 10732.4. However, srun shows the correct commands in all of the steps (10732.1, 10732.2, 10732.3 and 10732.5). Secondly, mpirun results show only 3 tasks with 6 CPUs are used but srun results correctly show 4 tasks with 8 CPUs in step 10732.1, 8 tasks with 8 CPUs in step 10732.2 and 1 task with 1 CPU in 10732.5 step. Finally, both steps 10732.3 and 10732.4 ran the same command NoSuchCommand where there is no such file or directory and should cause an error execution. However, mpirun wrapper still consider it is complete without error. Only srun wrapper get the FAIL state with an exit code 2.
From the job output in the following results, we see no difference between the outputs of the step 10732.0 (mpirun hello.exe) and the step 10732.1 (srun hello.exe). SLURM seems to get a good sacct information with srun but not with mpirun. If you wish to use the step information, do not forget to put srun in the command lines.
Currently Loaded Modules:
1) GCCcore/6.4.0 2) binutils/2.28 3) GCC/6.4.0-2.28 4) OpenMPI/2.1.1
====== mpirun hello.exe ======
Hello From: lac-380 I am the receiving processor 1 of 4
Hello From: lac-381 I am processor 2 of 4
Hello From: lac-382 I am processor 3 of 4
Hello From: lac-383 I am processor 4 of 4
====== srun hello.exe ======
Hello From: lac-380 I am the receiving processor 1 of 4
Hello From: lac-381 I am processor 2 of 4
Hello From: lac-382 I am processor 3 of 4
Hello From: lac-383 I am processor 4 of 4
====== srun -n 8 -c 1 hello.exe ======
Hello From: lac-380 I am the receiving processor 1 of 8
Hello From: lac-380 I am processor 2 of 8
Hello From: lac-381 I am processor 3 of 8
Hello From: lac-381 I am processor 4 of 8
Hello From: lac-382 I am processor 5 of 8
Hello From: lac-382 I am processor 6 of 8
Hello From: lac-383 I am processor 7 of 8
Hello From: lac-383 I am processor 8 of 8
====== srun ======
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-381: task 1: Exited with exit code 2
srun: error: lac-383: task 3: Exited with exit code 2
srun: error: lac-382: task 2: Exited with exit code 2
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-380: task 0: Exited with exit code 2
====== mpirun ======
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: lac-380
Executable: NoSuchCommand
--------------------------------------------------------------------------
====== scontrol show job 10732 ======
JobId=10732 JobName=test
UserId=changc81(804793) GroupId=helpdesk(2103) MCS_label=N/A
Priority=103 Nice=0 Account=classres QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:18 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2018-08-06T13:22:43 EligibleTime=2018-08-06T13:22:43
StartTime=2018-08-06T13:22:44 EndTime=2018-08-06T13:27:44 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-08-06T13:22:44
Partition=general-long-16 AllocNode:Sid=lac-249:5133
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lac-[380-383]
BatchHost=lac-380
NumNodes=4 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=4G,node=4,billing=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/mnt/home/changc81/GetExample/helloMPI/test
WorkDir=/mnt/home/changc81/GetExample/helloMPI
Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
StdIn=/dev/null
StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
Power=
For a complete instruction of sacct command, please refer to the SLURM web site.