Writing and submitting job scripts
The HPCC uses the SLURM system to manage computing resources. Users access these resources by submitting batch jobs.
This tutorial will walk you through the process of writing and submitting a job submission script for a parallel job that uses multiple cores across several nodes.
Clone and compile the example we're going to use:
1 2 3 4 5 6 7
This directory contains example C++ codes using several forms of parallelism. These examples may be useful if you find yourself developing your own software, and interested users should read the accompanying README.
For now we'll just use the
hybrid example. This example combines MPI and OpenMP. MPI allows multiple processes to communicate with each other, while OpenMP allows multiple CPUs to "collaborate" on the same process.
We would like to run 4 processes, each on their own node, with 2 CPUs per process. That means we'll need a total of 8 CPUs.
Writing a job script
A job script is a plain text file. It's composed of two main parts:
- The resource request
- The commands for running the job
nano or your preferred text editor, create and open
hybrid example likely uses a more complex set of resource requests than you will need for your own jobs, but it useful for illustrative purposes.
Recall from the previous section that we'd like to run
hybrid over 4 processes with 2 CPUs per process. Each process will also run on its own node. This outlines the resources we want to request.
Let's type up the first part of the job script, the resource request.
The very first line specifies the interpreter we want to use for our commands; in this case, it's the bash shell.
Then, each resource request line begins with
#SBATCH. All resources must be requested at the top of the file, before any commands, or they will be ignored.
The request lines are as follows:
- Wall clock limit - how long will the job run? This job will run for 10 minutes.
- The number of nodes; here, 4
- The number of tasks, also known as processes, running on each node. Here we want 1.
- The number of CPUs per task. The default is one, but we've requested 2.
- The amount of memory to use per CPU. We are requesting 1 GB each.
- The name of the job, so we can easily identify it later.
1 2 3 4 5 6 7 8
The resource request is just one part of writing a job script. The second part is running the job itself.
To run our job we need to:
- Load required modules
- Change to the appropriate directory
We'll add the following lines to
1 2 3 4 5 6
Notice that we use the
srun command to run our
hybrid executable. This command prepares the parallel runtime environment, setting up the requested 4 processes across 4 nodes and their associated CPUs.
You may already be familiar with
srun is similar to these commands it is preferred for use on the HPCC because of its connection to the SLURM scheduler.
We can also add a couple of optional commands that will save data about our job:
1 2 3 4 5 6
You have now completed your job script.
If you used
nano to write it, hit Ctrl+X followed by Y to save, then press Enter to accept the filename.
As was previously said, your job is most likely going to use much simpler resource specifications than shown above. You can see our example job scripts for more ideas.
By default, SLURM will try to use the settings if overriding commands aren't specified:
Batch job submission
Now that we have our job script, we need to submit it to the SLURM scheduler. For this, we use the
If the command has been submitted successfully, the job controller will issue a job ID on the screen. This ID can be used with, for example,
scancel to cancel the job or
sacct to look up stats about the job after it ends.
sbatch command only runs on development and compute nodes - it will not work on any gateway node.
Checking our job status
Once the job has been submitted, we can see it in the queue with
This will show us the following information:
- The job's ID number
- The job's name, which we specified in the script
- The job's submitting user (should be your username)
- The job's state (pending, running, or completed)
- The job's current walltime
- The job's allowed walltime
- The number of nodes requested and/or allocated to the job
- The reason why the job has the status it has
Viewing job outputs
Every SLURM job creates a file that contains the standard output and standard error from the job.
The default name is
<jobid> is the job ID assigned when the job was submitted.
Find the output log from your job and view it with
less <filename>. You should see several lines printing the thread and process information for each CPU involved.
The SLURM log files are essential for investigating whether or not your job ran successfully and for finding out why it failed.