Skip to content

Showing job steps

SLURM provides commands to show the execution information of each command line in a job script. This can be helpful for debugging and testing. In order to get such information, the wrapper command srun needs to be used. Let's take a look at the following job script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash

#SBATCH -N 4 -n 4 -c 2
#SBATCH --time=00:05:00
#SBATCH --mem=1G

module purge; module load GCC/6.4.0-2.28 OpenMPI/2.1.2
module list

mpicc mpi-hello.c -o hello.exe

echo; echo "====== mpirun hello.exe ======"
mpirun hello.exe                                            #0 Step

echo; echo "====== srun hello.exe ======"
srun hello.exe                                              #1 Step

echo; echo "====== srun -n 8 -c 1 hello.exe ======"
srun -n 8 -c 1 hello.exe                                    #2 Step

echo; echo "====== srun  ======"
srun NoSuchCommand                                          #3 Step

echo; echo "====== mpirun  ======"
mpirun NoSuchCommand                                        #4 Step

echo; echo "====== scontrol show job $SLURM_JOB_ID ======"
srun -N 1 -n 1 -c 1 scontrol show job $SLURM_JOB_ID         #5 Step

Although there are many command lines, only 6 of them are executed with either mpirun or srun wrapper and marked with the comments (from step 0 to 5) in the end. SLURM can record each of the 6 executions as a job step. Once the job is submitted by sbatch command and starts running, you can use sacct command to check the steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
$ sacct -j 10732
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10732              test general-l+   classres          8  COMPLETED      0:0
10732.batch       batch              classres          2  COMPLETED      0:0
10732.extern     extern              classres          8  COMPLETED      0:0
10732.0           orted              classres          6  COMPLETED      0:0
10732.1       hello.exe              classres          8  COMPLETED      0:0
10732.2       hello.exe              classres          8  COMPLETED      0:0
10732.3      NoSuchCom+              classres          8     FAILED      2:0
10732.4           orted              classres          6  COMPLETED      0:0
10732.5        scontrol              classres          1  COMPLETED      0:0

where the Job has ID 10732 and the 6 steps are shown from JobID 10732.0 to 10732.5.

We can also use a powertools command js to see more detailed information (such as memory usage and a list of used nodes) about the steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
$ js -j 10732 -C5

SLURM Job ID: 10732
===============================================================================================================================
          JobID |               10732 |         10732.batch |        10732.extern |             10732.0 |             10732.1 |
        JobName |                test |               batch |              extern |               orted |           hello.exe |
           User |            changc81 |                     |                     |                     |                     |
       NodeList |       lac-[380-383] |             lac-380 |       lac-[380-383] |       lac-[381-383] |       lac-[380-383] |
         NNodes |                   4 |                   1 |                   4 |                   3 |                   4 |
         NTasks |                     |                   1 |                   4 |                   3 |                   4 |
          NCPUS |                   8 |                   2 |                   8 |                   6 |                   8 |
         ReqMem |                 1Gn |                 1Gn |                 1Gn |                 1Gn |                 1Gn |
      Timelimit |            00:05:00 |                     |                     |                     |                     |
        Elapsed |            00:00:16 |            00:00:16 |            00:00:16 |            00:00:02 |            00:00:01 |
      SystemCPU |           00:03.283 |           00:00.562 |           00:00.001 |           00:00.646 |           00:00.572 |
        UserCPU |           00:02.119 |           00:00.753 |           00:00.003 |           00:00.396 |           00:00.281 |
       TotalCPU |           00:05.403 |           00:01.316 |           00:00.005 |           00:01.042 |           00:00.853 |
     AveCPULoad |            0.337687 |             0.08225 |           0.0003125 |               0.521 |               0.853 |
         MaxRSS |                     |              10409K |                120K |                861K |                863K |
      MaxVMSize |                     |             652100K |             173968K |             324440K |             324436K |
          Start | 2018-08-06T13:22:44 | 2018-08-06T13:22:44 | 2018-08-06T13:22:44 | 2018-08-06T13:22:54 | 2018-08-06T13:22:57 |
            End | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 | 2018-08-06T13:22:56 | 2018-08-06T13:22:58 |
       ExitCode |                 0:0 |                 0:0 |                 0:0 |                 0:0 |                 0:0 |
          State |           COMPLETED |           COMPLETED |           COMPLETED |           COMPLETED |           COMPLETED |
===============================================================================================================================
          JobID |             10732.2 |             10732.3 |             10732.4 |             10732.5 |
        JobName |           hello.exe |       NoSuchCommand |               orted |            scontrol |
           User |                     |                     |                     |                     |
       NodeList |       lac-[380-383] |       lac-[380-383] |       lac-[381-383] |             lac-380 |
         NNodes |                   4 |                   4 |                   3 |                   1 |
         NTasks |                   8 |                   4 |                   3 |                   1 |
          NCPUS |                   8 |                   8 |                   6 |                   1 |
         ReqMem |                 1Gn |                 1Gn |                 1Gn |                 1Gn |
      Timelimit |                     |                     |                     |                     |
        Elapsed |            00:00:01 |            00:00:00 |            00:00:01 |            00:00:00 |
      SystemCPU |           00:01.141 |           00:00.051 |           00:00.289 |           00:00.017 |
        UserCPU |           00:00.521 |           00:00.031 |           00:00.096 |           00:00.035 |
       TotalCPU |           00:01.663 |           00:00.083 |           00:00.385 |           00:00.053 |
     AveCPULoad |               1.663 |                     |               0.385 |                     |
         MaxRSS |              34812K |                865K |                865K |                840K |
      MaxVMSize |             324436K |             324436K |             324440K |             324436K |
          Start | 2018-08-06T13:22:58 | 2018-08-06T13:22:59 | 2018-08-06T13:22:59 | 2018-08-06T13:23:00 |
            End | 2018-08-06T13:22:59 | 2018-08-06T13:22:59 | 2018-08-06T13:23:00 | 2018-08-06T13:23:00 |
       ExitCode |                 0:0 |                 2:0 |                 0:0 |                 0:0 |
          State |           COMPLETED |              FAILED |           COMPLETED |           COMPLETED |
=========================================================================================================

From the results above, we can see the executions by mpirun are different from srun. First of all, for mpirun, the JobName only show "orted" no matter what commands are used in the steps 10732.0 and 10732.4. However, srun shows the correct commands in all of the steps (10732.1, 10732.2, 10732.3 and 10732.5). Secondly, mpirun results show only 3 tasks with 6 CPUs are used but srun results correctly show 4 tasks with 8 CPUs in step 10732.1, 8 tasks with 8 CPUs in step 10732.2 and 1 task with 1 CPU in 10732.5 step. Finally, both steps 10732.3 and 10732.4 ran the same command NoSuchCommand where there is no such file or directory and should cause an error execution. However, mpirun wrapper still consider it is complete without error. Only srun wrapper get the FAIL state with an exit code 2.

From the job output in the following results, we see no difference between the outputs of the step 10732.0 (mpirun hello.exe) and the step 10732.1 (srun hello.exe). SLURM seems to get a good sacct information with srun but not with mpirun. If you wish to use the step information, do not forget to put srun in the command lines.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Currently Loaded Modules:
  1) GCCcore/6.4.0   2) binutils/2.28   3) GCC/6.4.0-2.28   4) OpenMPI/2.1.1


====== mpirun hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 4
Hello From: lac-381              I am processor 2 of 4
Hello From: lac-382              I am processor 3 of 4
Hello From: lac-383              I am processor 4 of 4

====== srun hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 4
Hello From: lac-381              I am processor 2 of 4
Hello From: lac-382              I am processor 3 of 4
Hello From: lac-383              I am processor 4 of 4

====== srun -n 8 -c 1 hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 8
Hello From: lac-380              I am processor 2 of 8
Hello From: lac-381              I am processor 3 of 8
Hello From: lac-381              I am processor 4 of 8
Hello From: lac-382              I am processor 5 of 8
Hello From: lac-382              I am processor 6 of 8
Hello From: lac-383              I am processor 7 of 8
Hello From: lac-383              I am processor 8 of 8

====== srun  ======
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-381: task 1: Exited with exit code 2
srun: error: lac-383: task 3: Exited with exit code 2
srun: error: lac-382: task 2: Exited with exit code 2
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-380: task 0: Exited with exit code 2

====== mpirun  ======
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       lac-380
Executable: NoSuchCommand
--------------------------------------------------------------------------

====== scontrol show job 10732 ======
JobId=10732 JobName=test
   UserId=changc81(804793) GroupId=helpdesk(2103) MCS_label=N/A
   Priority=103 Nice=0 Account=classres QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:18 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2018-08-06T13:22:43 EligibleTime=2018-08-06T13:22:43
   StartTime=2018-08-06T13:22:44 EndTime=2018-08-06T13:27:44 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-08-06T13:22:44
   Partition=general-long-16 AllocNode:Sid=lac-249:5133
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=lac-[380-383]
   BatchHost=lac-380
   NumNodes=4 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=4G,node=4,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/home/changc81/GetExample/helloMPI/test
   WorkDir=/mnt/home/changc81/GetExample/helloMPI
   Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
   StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
   StdIn=/dev/null
   StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
   Power=

For a complete instruction of sacct command, please refer to the SLURM web site.