How Jobs are Scheduled
Schedulers
SLURM schedules jobs in two ways: the main scheduler and the backfill scheduler. The main scheduler constantly tries to start high priority jobs. The backfill scheduler considers all jobs, and starts any jobs that won't defer the start time of a higher priority job.
| Scheduler | Function | When it Runs | Run Time |
|---|---|---|---|
| Main | Launches high priority jobs that can start immediately. Stops evaluating jobs once it encounters a job that cannot be started. | About every 2 seconds | 0.08-2 seconds |
| Backfill | Evaluates the entire queue. Launches jobs that won't interfere with the start time of a higher priority job. Sets jobs' StartTime and SchedNodeList. | 20 seconds after the last backfill cycle completes | 2-15+ minutes |
StartTime and SchedNodeList
The backfill scheduler sets the StartTime and SchedNodeList parameters
on jobs that can start within the next 7 days. These parameters can be
viewed in the output of scontrol show job <jobid>. StartTime
estimates when a job will start and SchedNodeList shows the nodes this
job might start on. StartTime is only an estimate. These values are
updated every time the backfill scheduler runs and may change as running
jobs complete and new jobs are submitted.
Minimum Job Requirements to Avoid Deferment
Jobs must meet certain criteria before the backfill scheduler will avoid potentially deferring them through starting lower priority jobs. These thresholds allow the backfill scheduler to cycle faster and maintain high system utilization.
| Criteria | Minimum | Description |
|---|---|---|
| Priority | 3000 | Jobs require a minimum priority of 3000 is require to avoid potential deferment in scheduling. Buy-in account jobs are never below this threshold. |
Job Priority Factors
A job's priority is determined by a combination of several priority factors. Age, size, fairshare, and whether it was submitted to a buy-in account all contribute to the job’s priority.
| Priority Factor | Description | Maximum Contribution to Priority |
|---|---|---|
| Age | Starts at zero at job submission, then increases linearly to a maximum of 60000 after 30 days | 60000 after 30 days |
| Fairshare | Starts at 60000 and decreases and users' recent usage goes up. Usage for this calculation is decayed 50% each day | 60000 for no recent cluster usage |
| Size | Scales linearly with the amount of CPU and memory requested by a job. 100 per CPU, 20 per GB. | 52000+ depending on memory requested |
| QOS | Adds 3000 to buy-in jobs to ensure they are always above backfill schedulers minimum priority for reserving resources | 3000 |
FairShare
The FairShare priority factor of a job is calculated based on recent usage compared to overall cluster usage. Each user/account pair is assigned a "share" of the cluster based on the overall number of users in the accounting database. Usage is tracked based on the cluster's configured TRES (Trackable Resource) billing weights. A weight is set for CPUs, memory, and GPUs. Each job's allocated resources are multiplied by these weights and the job's run time to get a combined measure of TRES seconds. TRES seconds are then tracked for each user/account pair. When consumed TRES seconds is equal to the share of TRES seconds relative to the entire cluster, the FairShare factor will be 0.5 (30000 weighted), when consumed TRES seconds exceeds double the share for the entire cluster, the FairShare factor will be zero.
Usage accrual for this calculation decays with a half life of one day and the effect of this decay is calculated every five minutes.
The exact weights and share values change with the size of the cluster and accounting database, but can be viewed using the fairshare_info powertool.
$ module load powertools
$ fairshare_info
TresBillingWeights:
CPU = 1.0
Memory = 0.152
GPU = 250
Current Total Cluster Usage: 3102285712231 TRES seconds
FairShare Ratio Per User: 0.000051
User Portion of Cluster: 158216571 TRES seconds
FairShare is calculated every 00:05:00 and your usage decays 50% every 1-00:00:00
The following usage will reduce FairShare priority by half, twice will zero it:
43949 CPU Hours
176 GPU Hours
282 GB Hours
281 1 Hour / 1 CPU / 1 GB Jobs
108 1 Hour / 1 CPU / 1 GB / 1 GPU Jobs
Current FairShare Priority Status:
Account Priority Usage (TRES Seconds)
--------------------------------------------------
general 59727 1046312
scavenger 60000 0