Skip to content

SLURM Check, Modify and Cancel a Job using the scontrol & scancel commands

scontrol command

Besides the brief listing of every job using the squeue command, a user can also see the detailed information of each job. Run the SLURM command scontrol show with a job ID:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ scontrol show job 8929
JobId=8929 JobName=test
   UserId=nobody(804293) GroupId=helpdesk(2103) MCS_label=N/A
   Priority=404 Nice=0 Account=classres QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2018-08-01T14:33:04 EligibleTime=2018-08-01T14:33:04
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-08-03T12:38:48
   Partition=general-short-14,general-short-16,general-short-18,general-long-14,general-long-16,general-long-18,classres-14,classres-16 AllocNode:Sid=dev-intel18:4996
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=80-80 NumCPUs=160 NumTasks=80 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=80G,node=40,gres/gpu=40
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0
   Features=intel14 DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/home/changc81/GetExample/helloMPI/test
   WorkDir=/mnt/home/changc81/GetExample/helloMPI
   Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   StdIn=/dev/null
   StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   Power=

You can check if the information is right for the job. If the job has not started to run and you would like change any specification, you can hold the job first using the scontrol hold command:

1
2
3
4
5
$ scontrol hold 8929
$ squeue -l -u $USER
Fri Aug  3 12:26:57 2018
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              8929 general-s     test   nobody  PENDING       0:00      1:00     80 (JobHeldUser)

where you can see from the results of the squeue command, the job is pending due to the user's hold. You can choose the information you want to change in scontrol show results. Put them in the scontrol update command and modify the information after the = symbol. For example, the command line

1
$ scontrol update job 8929  NumNodes=2-2 NumTasks=2 Features=intel16

will change the resource request of the job 8929 from 80 nodes and 80 tasks with intel14 nodes to 2 nodes and 2 tasks with intel16 nodes. After the update, you can use the scontrol show command again to verify the job setting. Once you are done with the update work, you can release the job hold by command scontrol release:

1
2
3
4
5
$ scontrol release 8929
$ squeue -l -u $USER
Fri Aug  3 13:18:10 2018
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              8929 general-s     test   nobody  RUNNING       0:07      1:00      2 lac-[386-387]

The job is now running due to the change of the resource request by the command scontrol update. Again, we can check the running job using the command scontrol show:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ scontrol show job 8929
JobId=8929 JobName=test
   UserId=changc81(804793) GroupId=helpdesk(2103) MCS_label=N/A
   Priority=379 Nice=0 Account=classres QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2018-08-01T14:33:04 EligibleTime=2018-08-01T14:33:04
   StartTime=2018-08-03T13:18:03 EndTime=2018-08-03T13:18:11 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-08-03T13:18:03
   Partition=general-long-16 AllocNode:Sid=dev-intel18:4996
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=lac-[386-387]
   BatchHost=lac-386
   NumNodes=2 NumCPUs=4 NumTasks=2 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=4G,node=2,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0
   Features=intel16 DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/mnt/home/changc81/GetExample/helloMPI/test
   WorkDir=/mnt/home/changc81/GetExample/helloMPI
   Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   StdIn=/dev/null
   StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
   Power=

For complete usage information about the scontrol command, please refer to https://slurm.schedmd.com/scontrol.html at the SLURM web site.

scancel command

If at any moment before the job complete, you would like to remove the job, you can use the scancel command to cancel a job. For example, the command

1
$ scancel 8929

will cancel job 8929. For a complete usage information about the scancel command, please refer to https://slurm.schedmd.com/scancel.html at the SLURM web site.