Job Policies

The following limits apply generally to all MSU users of the HPCC. Those at affiliate institutions may be working under slightly different policies. The limits are in place to help our large user community share the HPCC. However, if these policies are an impediment to completing your research, please contact us.

CPU and GPU usage limits

HPCC users who do not have a buy-in account are given a 'general' SLURM account. The general account is limited to 500,000 CPU hours (30,000,000 minutes) and 10,000 GPU hours (600,000 minutes) every year (from January 1st to December 31st) starting from 2021.
- A CPU hour is the walltime of your job multiplied by the number of CPUs used. The same applies for GPU hours.
There is no yearly usage limit on CPU or GPU time with a buy-in account. If you have a buy-in account, your jobs will be run under that account by default, unless the manager of the buy-in account has chosen to opt-in (requiring jobs to be submitted with the -A flag) instead of opt-out.
Users with general accounts can use the powertools command SLURMUsage to check their used CPU and GPU time (in minutes) and remaining CPU and GPU time (in hours):

$ ml powertools # run this command if powertools not loaded

$ SLURMUsage
If users without a buy-in account need more CPU or GPU time due to reaching the limits, they can request additional CPU/GPU hours by filling out the CPU/GPU Increase Request online form.

Limits on job resource requests

Time: Users can schedule jobs and run for at most 7 days (168 hours) ( --time=168:00:00)
CPU: Users can utilize up to a total of 1040 cores and have at most 520 jobs running at any one time. The core usage value is reflected in the SLURM variable QOSMaxCpuPerUserLimit(Buyin groups who have purchased more than 1040 cores can exceed this limit)
Queue: The maximum number of jobs that can be queued or running per user is 1000 jobs.

Buy-in program

Faculty can purchase nodes via our buy-in program. The program guarantees jobs submitted with a buy-in group will start running on their buy-in nodes in 4 hours. However, due to contention between buy-in group jobs, the guarantee might not be fulfilled if requested resources are occupied or reserved by other jobs of the buy-in group.

Policy summary

Jobs that run under 4 hours are able to run on the largest set of nodes (the combination of community + specialized hardware + buy-in nodes. See below for details)
Jobs that request more resources (processors or RAM) have priorities over smaller jobs because these jobs are more difficult to schedule.
Jobs accrue priority based on how long they have been queued.
The scheduler will attempt to balance usage among all users. (See Fairshare Policy below.)
It is against our fair use policy to artificially increase the priority of a job in the queue (e.g. by requesting more resources which will not be used). Jobs found to be manipulating the scheduler will be canceled, and users continuing to attempt this will be suspended.

More about queue time

This section gives a brief overview of the factors that affect how long your job sits in the SLURM queue. For more information, see the page on how jobs are scheduled by SLURM as well as other pages under "Understanding the Scheduler."

Fairshare

As jobs wait in the queue, they accrue priority to run. Another factor that contributes to a job's priority value is Fairshare. The scheduler will attempt to ensure fair resource utilization of all HPCC users by adjusting the initial priorities of the users who have recently used HPCC resources. Due to the policy, if users had jobs running with many resources recently, their current pending jobs might wait longer than before. Users can find the Fairshare contribution to a job priority by running command "sprio -u $USER":

[UserID@dev-intel18 UserID]$ sprio -u $USER
          JOBID PARTITION     USER   PRIORITY       SITE        AGE  FAIRSHARE        QOS                 TRES
       53381467 general-l   UserID      49432          0          0      49318          0       cpu=100,mem=15
       53381467 general-s   UserID      49432          0          0      49318          0       cpu=100,mem=15

where it is found under FAIRSHARE column and the values are between 60,000 (highest priority contribution) and 0 (lowest priority contribution). The more resources your jobs used recently, the less your Fairshare value will become, resulting in lower overall priority for your jobs. For other contributions of sprio results, please check Job Priority Factors.

Shorter jobs can run on more nodes

Jobs that request a total running (wall-clock) time of four hours or less can run on any available buy-in and specialized nodes. Because they can access any nodes, they are likely to start running more quickly than the jobs which have to wait for the general-long partition nodes.

Bigger jobs are prioritized & small jobs are backfilled

The scheduler attempts to gather resources for large jobs and then backfill smaller jobs around them. The size of the job is determined by the number of CPUs and amount of memory requested.

The scheduler packs small jobs together to allow more resources to be gathered for multi-core jobs. Resource requests are monitored. Abusive resource requests may violate MSU policy.