SLURM provides commands to show the execution information of each command line in a job script. This can be helpful for debugging and testing. In order to get such information, the wrapper command srun needs to be used. Let's take a look at the following job script:
Although there are many command lines, only 6 of them are executed with
either mpirun or srun wrapper and marked with the comments (from step 0 to 5) in the end. SLURM can record each of the 6 executions as
a job step. Once the job is submitted by sbatch command and starts
running, you can use sacct command to check the steps:
From the results above, we can see the executions by mpirun are
different from srun. First of all, for mpirun, the JobName only
show "orted" no matter what commands are used in the steps 10732.0 and
10732.4. However, srun shows the correct commands in all of the
steps (10732.1, 10732.2, 10732.3 and 10732.5). Secondly, mpirun
results show only 3 tasks with 6 CPUs are used but srun results
correctly show 4 tasks with 8 CPUs in step 10732.1, 8 tasks with 8 CPUs
in step 10732.2 and 1 task with 1 CPU in 10732.5 step. Finally, both
steps 10732.3 and 10732.4 ran the same command NoSuchCommand where
there is no such file or directory and should cause an error execution.
However, mpirun wrapper still consider it is complete without error.
Only srun wrapper get the FAIL state with an exit code 2.
From the job output in the following results, we see no difference
between the outputs of the step 10732.0 (mpirun hello.exe) and the
step 10732.1 (srun hello.exe). SLURM seems to get a good sacct
information with srun but not with mpirun. If you wish to use
the step information, do not forget to put srun in the command
lines.
Currently Loaded Modules:
1) GCCcore/6.4.0 2) binutils/2.28 3) GCC/6.4.0-2.28 4) OpenMPI/2.1.1
====== mpirun hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 4
Hello From: lac-381 I am processor 2 of 4
Hello From: lac-382 I am processor 3 of 4
Hello From: lac-383 I am processor 4 of 4
====== srun hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 4
Hello From: lac-381 I am processor 2 of 4
Hello From: lac-382 I am processor 3 of 4
Hello From: lac-383 I am processor 4 of 4
====== srun -n 8 -c 1 hello.exe ======
Hello From: lac-380 I am the recieving processor 1 of 8
Hello From: lac-380 I am processor 2 of 8
Hello From: lac-381 I am processor 3 of 8
Hello From: lac-381 I am processor 4 of 8
Hello From: lac-382 I am processor 5 of 8
Hello From: lac-382 I am processor 6 of 8
Hello From: lac-383 I am processor 7 of 8
Hello From: lac-383 I am processor 8 of 8
====== srun ======
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-381: task 1: Exited with exit code 2
srun: error: lac-383: task 3: Exited with exit code 2
srun: error: lac-382: task 2: Exited with exit code 2
slurmstepd: error: execve(): NoSuchCommand: No such file or directory
srun: error: lac-380: task 0: Exited with exit code 2
====== mpirun ======
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: lac-380
Executable: NoSuchCommand
--------------------------------------------------------------------------
====== scontrol show job 10732 ======
JobId=10732 JobName=test
UserId=changc81(804793) GroupId=helpdesk(2103) MCS_label=N/A
Priority=103 Nice=0 Account=classres QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:18 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2018-08-06T13:22:43 EligibleTime=2018-08-06T13:22:43
StartTime=2018-08-06T13:22:44 EndTime=2018-08-06T13:27:44 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-08-06T13:22:44
Partition=general-long-16 AllocNode:Sid=lac-249:5133
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lac-[380-383]
BatchHost=lac-380
NumNodes=4 NumCPUs=8 NumTasks=4 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=8,mem=4G,node=4,billing=8
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/mnt/home/changc81/GetExample/helloMPI/test
WorkDir=/mnt/home/changc81/GetExample/helloMPI
Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
StdIn=/dev/null
StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-10732.out
Power=
For a complete instruction of sacct command, please refer to the SLURM web site.