SLURM Check, Modify and Cancel a Job using the scontrol & scancel commands
scontrol command
Besides the brief listing of every job using the squeue
command,
a user can also see the detailed information of each job. Run the SLURM
command scontrol show
with a job ID:
$ scontrol show job 8929
JobId=8929 JobName=test
UserId=nobody(804293) GroupId=helpdesk(2103) MCS_label=N/A
Priority=404 Nice=0 Account=classres QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2018-08-01T14:33:04 EligibleTime=2018-08-01T14:33:04
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-08-03T12:38:48
Partition=general-short-14,general-short-16,general-short-18,general-long-14,general-long-16,general-long-18,classres-14,classres-16 AllocNode:Sid=dev-intel18:4996
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=80-80 NumCPUs=160 NumTasks=80 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=40,mem=80G,node=40,gres/gpu=40
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0
Features=intel14 DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/mnt/home/changc81/GetExample/helloMPI/test
WorkDir=/mnt/home/changc81/GetExample/helloMPI
Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
StdIn=/dev/null
StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
Power=
You can check if the information is right for the job. If the job has
not started to run and you would like change any specification, you can
hold the job first using the scontrol hold
command:
$ scontrol hold 8929
$ squeue -l -u $USER
Fri Aug 3 12:26:57 2018
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
8929 general-s test nobody PENDING 0:00 1:00 80 (JobHeldUser)
where you can see from the results of the squeue
command, the job is
pending due to the user's hold. You can choose the information you want to change in scontrol show
results. Put them in the
scontrol update
command and modify the information after the =
symbol. For example, the command line
$ scontrol update job 8929 NumNodes=2-2 NumTasks=2 Features=intel16
will change the resource request of the job 8929 from 80 nodes and 80
tasks with intel16 nodes to 2 nodes and 2 tasks with intel16 nodes.
After the update, you can use the scontrol show
command again to verify
the job setting. Once you are done with the update work, you can release
the job hold by command scontrol release
:
$ scontrol release 8929
$ squeue -l -u $USER
Fri Aug 3 13:18:10 2018
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
8929 general-s test nobody RUNNING 0:07 1:00 2 lac-[386-387]
The job is now running due to the change of the resource request by the
command scontrol update
. Again, we can check the running job using the
command scontrol show
:
$ scontrol show job 8929
JobId=8929 JobName=test
UserId=changc81(804793) GroupId=helpdesk(2103) MCS_label=N/A
Priority=379 Nice=0 Account=classres QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2018-08-01T14:33:04 EligibleTime=2018-08-01T14:33:04
StartTime=2018-08-03T13:18:03 EndTime=2018-08-03T13:18:11 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2018-08-03T13:18:03
Partition=general-long-16 AllocNode:Sid=dev-intel18:4996
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lac-[386-387]
BatchHost=lac-386
NumNodes=2 NumCPUs=4 NumTasks=2 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=4G,node=2,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=2G MinTmpDiskNode=0
Features=intel16 DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/mnt/home/changc81/GetExample/helloMPI/test
WorkDir=/mnt/home/changc81/GetExample/helloMPI
Comment=stdout=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
StdErr=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
StdIn=/dev/null
StdOut=/mnt/home/changc81/GetExample/helloMPI/slurm-8929.out
Power=
For complete usage information about the scontrol
command, please refer to
https://slurm.schedmd.com/scontrol.html at the
SLURM web site.
scancel command
If at any moment before the job complete, you would like to remove
the job, you can use the scancel
command to cancel a job. For example,
the command
$ scancel 8929
will cancel job 8929. For a complete usage information about the scancel
command, please refer to
https://slurm.schedmd.com/scancel.html at the
SLURM web site.