Display Compute Nodes and Job Partitions by sinfo command
Information of Compute Nodes
If you would like to run a job with a lot of resources, it is a
good idea to check available resources, such as which nodes are available
as well as how many cores and how much memory is available on those nodes,
so the job will not wait for too much time. Users can use SLURM command sinfo to get
a list of nodes controlled by the job scheduler. Such as, running the
command sinfo -N -r -l
, where the specifications -N
for showing
nodes, -r
for showing nodes only responsive to SLURM and -l
for
long description are used.
However, for each node, sinfo
displays all possible partitions
and causes repetitive information. Here, the powertools command
node_status
can be used to display much better results:
$ node_status # powertools command
Wed Apr 22 11:14:40 EDT 2020
NodeName Account State CPU(Load:Aloc Idl:Tot) Mem(Aval:Tot)Mb GPU(I:T) Reason
----------------------------------------------------------------------------------------------------------
csm-001 general ALLOCATED 13.61: 20 0: 20 45186: 246640 N/A
csm-002 albrecht MIXED 10.14: 15 5: 20 1072: 246640 N/A
csm-003 colej ALLOCATED 7.45: 20 0: 20 50032: 246640 N/A
......
csn-005 general MIXED 9.92: 12 8: 20 16160: 118012 k20(0:2)
......
cs* => 33.3%(buyin) 91.4%(162) 43.6%: 59.5%( 3240) 69.9%(17.0Tb) 97%( 78) Usage%(Total)
......
......
lac-078 general MIXED 11.38: 8 20: 28 69884: 118012 N/A
lac-079 ptg ALLOCATED 22.37: 28 0: 28 15612: 118012 N/A
lac-080 merzjrke MIXED 2.48: 16 12: 28 50032: 246640 k80(0:8)
......
......
vim-002 ccg MIXED 66.14: 63 81:144 5427008:6145856 N/A
intel16 => 69.0%(buyin) 98.8%(429) 55.2%: 65.1%(12200) 76.6%(79.9Tb) 70%(384) Usage%(Total)
intel18 => 63.6%(buyin) 99.4%(176) 45.8%: 55.8%( 7040) 77.1%(31.3Tb) 55%( 64) Usage%(Total)
Summary => 60.3%(buyin) 97.4%(773) 51.2%: 61.9%(22816) 73.1%( 142Tb) 72%(526) Usage%(Total
The result of node_status
is a good reference to find out how many nodes available for your
jobs as it displays important information including node names, buyin accounts, node states,
CPU cores, memory, GPU, and the reason the node is unavailable.
If you need more complete details of a particular node, you can use
scontrol show node -a <node_name>
command:
$ scontrol show node -a skl-166
NodeName=skl-166 Arch=x86_64 CoresPerSocket=20
CPUAlloc=0 CPUTot=40 CPULoad=0.01
AvailableFeatures=skl,gbe,intel18,ib,edr18
ActiveFeatures=skl,gbe,intel18,ib,edr18
Gres=(null)
NodeAddr=skl-166 NodeHostName=skl-166 Version=18.08
OS=Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018
RealMemory=376162 AllocMem=0 FreeMem=382562 Sockets=2 Boards=1
State=DOWN ThreadsPerCore=1 TmpDisk=174080 Weight=103 Owner=N/A MCS_label=N/A
Partitions=general-short,general-short-18,general-long,general-long-18,qian-18,nvl-benchmark-18,piermaro-18,vmante-18,liulab-18,devolab-18,tsangm-18,plzbuyin-18,chenlab-18,shadeash-colej-18,allenmc-18,cmse-18,seiswei-18,niederhu-18,daylab-18,junlin-18,mitchmcg-18,pollyhsu-18,davidroy-18,yueqibuyin-18,eisenlohr-18
BootTime=2019-02-11T15:07:38 SlurmdStartTime=2019-02-11T15:08:44
CfgTRES=cpu=40,mem=376162M,billing=57176
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Currently being imaged [fordste5@2019-02-11T09:49:30]
SLURM Partitions for Jobs
One of the important details about a node is what kind of
jobs can run on it. For example, if a node is a buy-in node, only jobs
with walltime equal to or less than 4 hours can run for a non-buyin
users. We can check the summary of all partitions using sinfo
with
the -s
specification:
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
general-short up 4:00:00 729/26/16/771 csm-[001-005,007-010,017-022],csn-[001-039],csp-[006-007,016-020,025-026],css-[001-003,007-012,014,016-020,023,032-036,038-045,047-050,052-067,071-072,074-076,079-085,087-095,097-103,106-109,111-127],lac-[000-225,228-247,250-261,276-369,372,374-445],nvl-[000-007],qml-[000-005],skl-[000-167],vim-[000-002]
general-long up 7-00:00:00 269/0/8/277 csm-001,csn-020,csp-[006-007,016-018,020,025],css-[008-012,014,016-019,023,032,034-036,038-045,047-050,052-066,071,075-076,079-080,083,087-089,092-095,097-099,107,118,121,124,126],lac-[038-044,078,123,209,217,225,228,230-235,246-247,276-284,300-301,336-339,353-360,363-364,372,374-399,401-420,422-445],skl-[023,026-112]
general-long-bigmem up 7-00:00:00 17/0/0/17 lac-[252-253,306],qml-[000,005],skl-[143-147,162-167],vim-001
general-long-gpu up 7-00:00:00 46/12/0/58 csn-[001-019,021-036],lac-[030,087,137,143,192-199,287-290,292-293,342,348],nvl-[005-007]
where the list of job partitions and their setup for walltime limit and
nodes are shown. More detailed information for each job partition can
also be found by -p
specification:
$ sinfo -p general-long -r -l
Mon Jul 13 12:22:16 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
general-long up 7-00:00:00 1-infinite no NO all 2 draining lac-[231,247]
general-long up 7-00:00:00 1-infinite no NO all 1 drained css-053
general-long up 7-00:00:00 1-infinite no NO all 217 mixed csm-001,csp-[006,017-018,020,025],css-[010,018-019,023,032,034-035,038,044,047-049,052,055-056,061-066,075,088-089,098-099,107,118,126],lac-[038-044,078,123,209,217,225,228,230,232,234-235,276-280,282-284,300-301,336-337,339,353-360,363,372,374-382,384-399,401-420,423,427-445],skl-[023,026,028-029,031,033-034,036-042,044-046,048,050-067,069-079,081-094,096-106,108-112]
general-long up 7-00:00:00 1-infinite no NO all 50 allocated csn-020,csp-016,css-[008-009,011,016-017,036,039-043,045,050,054,057-060,083,087,092-095,097,121,124],lac-[233,246,281,338,364,383,422,424-426],skl-[027,030,032,035,043,047,049,068,080,095,107]
Users can also show nodes only allowed for specific job partitions by
using -N
and -p
:
$ sinfo -N -l -r -p general-short,general-long
Mon Jul 13 12:25:58 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
csm-001 1 general-short mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
csm-001 1 general-long mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
csm-002 1 general-short mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
csm-003 1 general-short mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
csm-004 1 general-short mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
csm-005 1 general-short mixed 20 2:10:1 246640 174080 101 gbe,ib,i none
...
...
skl-166 1 general-short mixed 40 2:20:1 376162 174080 103 skl,gbe, none
skl-167 1 general-short mixed 40 2:20:1 376162 174080 103 skl,gbe, none
vim-000 1 general-short mixed 64 4:16:1 306780 174080 102 gbe,inte none
vim-001 1 general-short mixed 64 4:16:1 306780 174080 102 gbe,inte none
vim-002 1 general-short allocated 144 8:18:1 614585 174080 102 gbe,inte none
For a complete instruction of sinfo
, please refer to
the SLURM web page.