Using DMTCP to checkpoint your batch jobs

This tutorial will focus on using the tool DMTCP to checkpoint your code in batch jobs.

For an overview of why you might want to checkpoint and alternative approaches, please see our Checkpointing Overview.

If you haven't already, please review the tutorial on Checkpointing with DMTCP as this tutorial directly builds on it. The checkpointing workflow employed in that tutorial illustrates the basic concepts well, but is not incredibly useful. Most of the time, the long running work you do on the HPCC happens on compute nodes in Slurm batch scripts. This tutorial will convert that workflow into a script that we'll test on the development node, and eventually turn into a Slurm batch script.

If you want to skip to the final product, see our Template for general purpose checkpointing scripts

Learning objectives

After this tutorial, you should:

Be able to checkpoint a sample Python script submitted to SLURM
Know where checkpoint files are saved, and be able to change the location
Have the tools to checkpoint your own, more complex code

Sample script

We will use the sample script from the previous tutorial. If you haven't already, please follow in that section to continue.

Running code in a script

We'll start by creating a basic script that repeats the steps we've already used. Copy the following code into a file called checkpoint_script.sh:

checkpoint_script.sh

#!/bin/bash --login

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

# Launch the code with DMTCP
dmtcp_launch ./count.py 60 > run.out &

wait

Note that in the previous tutorial, we were using whatever version Python was loaded by default on the development nodes. To better reproduce our work later, we are purging all modules, and explicitly loading one version of Python. We are also loading DTMCP like we saw before.

Additionally, we add the wait line so that the script doesn't finish and end the job while our code is running in the background. We could also achieve this by removing the &, but it will be important when we add signal trapping later.

Check that your code runs by making it executable, and running it:

chmod u+x ./checkpoint_script.sh
./checkpoint_script.sh

It will run for 60 seconds before printing all output to run.out.

Automatically checkpointing

As is, this code can still be checkpointed using dmtcp_command --checkpoint, but that requires us to manually intervene whenever we want a checkpoint. This can be automated by adding an interval to dmtcp_launch. Before this, clear out any old checkpoints, restart scripts, and output by running:

rm ckpt* dmtcp* *.out

Update your checkpoint script (updates are highlighted here and below) so that we add the argument --interval 15 to dmtcp_launch:

checkpoint_script.sh

#!/bin/bash --login

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

# Launch the code with DMTCP
dmtcp_launch --interval 15 ./count.py 60 > run.out &

wait

Run the code with

./checkpoint_script.sh

After it finishes, you will see one checkpoint that DMTCP has updated every 15 seconds. Restart with

dmtcp_restart ckpt*.dmtcp > run.out

and it will complete after 15 seconds.

Making additional checkpoints after restarting

In this small example, the code will finish after it's restarted. But if you want to continue making checkpoints after restarting (see below for an example), you can add the --interval option to dmtcp_restart:

dmtcp_restart --interval 15 ckpt*.dmtcp > run.out

Simulating a kill/restart cycle

We will now handle the situation where your code gets killed (e.g., you reached your requested time limit in a Slurm job, or your job was preempted in the Scavenger Queue), and you use a checkpoint to restart it.

Remove all DMTCP files and your output file:

rm ckpt* dmtcp* *.out

We'll modify the script so that if there's a restart file, we use that. Otherwise, we start the code from scratch:

checkpoint_script.sh

#!/bin/bash --login

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

if [ -f dmtcp_restart_script.sh ]; then
    # Restart a previous checkpoint if it exists and continue checkpointing
    dmtcp_restart --interval 15 ckpt*.dmtcp > run.out &
else
    # Launch the code with DMTCP
    dmtcp_launch --interval 15 ./count.py 60 > run.out &
fi

wait

Then start the script in the background:

./checkpoint_script.sh &

After at least one checkpoint has been created (i.e., after at least 15 seconds), kill the running code with

dmtcp_command --kill

You can then restart from that checkpoint by running the same script again:

./checkpoint_script.sh

You can continue to kill and restart as many times as you like until the code finishes.

Creating a re-submittable Slurm script

Now, we will add a resource request that only gives us enough time to complete one check-point. We have to increase the amount of time we count to 90 seconds since Slurm doesn't like jobs shorter than one minute.

We will also add some echo statements, so we can see what's happening in the Slurm output file.

checkpoint_script.sh

#!/bin/bash --login
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1GB
#SBATCH --constraint=amd20
#SBATCH --job-name=checkpoint-job

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

echo "Script has started"

if [ -f dmtcp_restart_script.sh ]; then
    # Restart a previous checkpoint if it exists and continue checkpointing
    echo "Restarting from a checkpoint"
    dmtcp_restart --interval 15 ckpt*.dmtcp > run.out &
else
    # Launch the code with DMTCP
    echo "Starting from scratch"
    dmtcp_launch --interval 15 ./count.py 90 > run.out &
fi

wait

echo "Script has finished"

Remove all checkpoint and output files, then submit the script to Slurm:

rm ckpt* dmtcp* *.out
sbatch checkpoint_script.sh

Your job will be cancelled before the code can complete. You can check that run.out is empty, and that your Slurm output file contains something like:

Script has started
Starting from scratch
slurmstepd: error: *** JOB ######## ON hostname CANCELLED AT 2025-12-31T23:59:59 DUE TO TIME LIMIT ***

Resubmit your script, and it should complete after starting from a checkpoint:

sbatch checkpoint_script.sh

Getting the script to resubmit itself

Now we have a script that will eventually complete, as long as we resubmit it. The final step is getting it to resubmit itself when it reaches its time limit. The mechanism to do this is called "signal trapping."

Edit your script to appear as follows:

checkpoint_script.sh

#!/bin/bash --login
#SBATCH --time=00:04:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1GB
#SBATCH --constraint=amd20
#SBATCH --job-name=checkpoint-job
#SBATCH --dependency=singleton
#SBATCH --signal=B:USR1@30

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

# Set up a signal handler to checkpoint, finish the script, and restart with 30
# seconds left

trap "echo stopping job; dmtcp_command --kcheckpoint; sbatch checkpoint_script.sh; exit 0" USR1

echo "Script has started"

if [ -f dmtcp_restart_script.sh ]; then
    # Restart a previous checkpoint if it exists and continue checkpointing
    echo "Restarting from a checkpoint"
    dmtcp_restart --interval 60 ckpt_*.dmtcp > run.out &
else
    # Launch the code with DMTCP
    echo "Starting from scratch"
    dmtcp_launch --interval 60 ./count.py 300 > run.out &
fi

wait

echo "Script has finished"

The key changes here are:

Using #SBATCH --signal=B:USR1@30 to send a signal (USR1) to the batch script (B) 30 seconds from the end of the job (@30)
"Trapping" that signal (trap ... USR1), so that when the batch script receives it, it runs a piece of code (stop the job, checkpoint, kill running process, and resubmit the batch script)
Increasing the number to count to (up to 300, so the script will take five minutes) and the checkpointing interval time; this lets us see the full effects of signal trapping better

Note that since the requested time is only four minutes, it will be impossible for the script to count up to 300. With 30 seconds remaining (give or take, since Slurm can send the signal up to one minute early), the code will get checkpointed, stopped, and the job will resubmit itself. The code should finish running in two jobs, but may take more depending on when exactly the signals get sent.

Why keep checkpointing if it happens at the end of a job?

Our code still checkpoints every minute using the --interval 60 flag. This seems unnecessary, since the job will write a checkpoint once it reaches the job's time limit anyways.

It's still useful to keep this for two reasons:

It makes our script more robust. If there is a random problem that shuts down the code, we won't have lost any progress, even though the checkpoint at the end hasn't happened.
It allows for the script to be killed before hitting the time limit without losing any progress. This is especially important for jobs submitted to the Scavenger Queue.

Clear your previous checkpoints and output, and try submitting the script:

rm ckpt* dmtcp* *.out
sbatch checkpoint_script.sh

While your code is running it is useful to check the state of your output files:

tail -f *.out

This will eventually show you when your various Slurm jobs complete and when the full output is written to run.out.

Conclusion

Let's review. The script we've created can:

Take a very long job, and dynamically separate it into many smaller jobs.
Allow you to start a job without knowing how long it will take. For example, you can set the time limit to be less than four hours to give you access to the largest amount of nodes and decrease your wait time. Your job will continue running in four hour increments until it finishes.
Take advantage of the Scavenger Queue with one small change.

See the bonus sections below for more tips and tricks, and reference our Template for a general purpose checkpointing script when you want to implement checkpointing in the future.

Bonus: Submitting the job to the scavenger queue

We now have a batch script for code that will regularly checkpoint and resubmit itself. With one small change, we can take advantage of the HPCC's Scavenger Queue. The Scavenger Queue has a few significant benefits:

You are not limited in the number of jobs or cores you request on the Scavenger Queue
Your jobs do not count against your CPU and GPU limits
Your jobs use HPCC resources that are otherwise left idle

However, this comes at the cost of your jobs possibly being preempted, or interrupted, whenever the resources your job is using are needed by a job sent to the main scheduler. By default, jobs in the scavenger queue will resubmit themselves after they are interrupted.

To use the scavenger queue, just add the line #SBATCH --qos=scavenger to your batch script:

checkpoint_script.sh

#!/bin/bash --login
#SBATCH --time=00:04:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1GB
#SBATCH --constraint=amd20
#SBATCH --job-name=checkpoint-job
#SBATCH --dependency=singleton
#SBATCH --signal=B:USR1@30
#SBATCH --qos=scavenger

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

# Set up a signal handler to checkpoint, finish the script, and restart with 30
# seconds left

trap "echo stopping job; dmtcp_command --kcheckpoint; sbatch checkpoint_script.sh; exit 0" USR1

echo "Script has started"

if [ -f dmtcp_restart_script.sh ]; then
    # Restart a previous checkpoint if it exists
    echo "Restarting from a checkpoint"
    dmtcp_restart ckpt_*.dmtcp > run.out &
else
    # Launch the code with DMTCP
    echo "Starting from scratch"
    dmtcp_launch --interval 60 ./count.py 300 > run.out &
fi

wait

echo "Script has finished"

Bonus: Saving checkpoint files to your scratch space

The checkpoints that DMTCP creates hold the entire state of your program when they are checkpointed. This means, they can be extremely large, especially if you are using large amounts of memory. An ideal location for checkpoint files is your scratch space, since your scratch space is large (50TB) and temporary (files will be purged if they haven't been changed in 45 days).

To save your checkpoint files to a different location, you can set the environment variable DMTCP_CHECKPOINT_DIR before running any DMTCP commands. In the example below, we create a new directory in scratch corresponding to your Slurm job name and save the files there.

checkpoint_script.sh

#!/bin/bash --login
#SBATCH --time=00:04:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1GB
#SBATCH --constraint=amd20
#SBATCH --job-name=checkpoint-job
#SBATCH --dependency=singleton
#SBATCH --signal=B:USR1@30
#SBATCH --qos=scavenger

# Load modules
module purge
module load Python/3.11.3-GCCcore-12.3.0
module load DMTCP/4.0.0

# Set location for checkpoint files
export DMTCP_CHECKPOINT_DIR="${SCRATCH}/${SLURM_JOB_NAME}_checkpoint"
mkdir -p "${DMTCP_CHECKPOINT_DIR}"

# Set up a signal handler to checkpoint, finish the script, and restart with 30
# seconds left

trap "echo stopping job; dmtcp_command --kcheckpoint; sbatch checkpoint_script.sh; exit 0" USR1

echo "Script has started"

if [ -f dmtcp_restart_script.sh ]; then
    # Restart a previous checkpoint if it exists
    echo "Restarting from a checkpoint"
    dmtcp_restart "${DMTCP_CHECKPOINT_DIR}/ckpt_*.dmtcp" > run.out &
else
    # Launch the code with DMTCP
    echo "Starting from scratch"
    dmtcp_launch --interval 60 ./count.py 300 > run.out &
fi

wait

echo "Script has finished"

Bonus: Taking care when writing to files

DMTCP is smart about how it handles open files when the program is being checkpointed. In particular, it will remember where in the file it was and resume writing at the same location next time. Sometimes, it needs to save the entire state of the file in its checkpoint to do so (so make sure to be careful about where you save your checkpoint files).

However, for files that open and close, restarting a checkpoint can result in duplicated output. Let's look at an example.

First, revise the count.py script as follows:

count.py

#!/usr/bin/env python3

import argparse
import time 

def main(args):
    filename = "run.out"
    with open(filename, "w") as f:
        print("=== Starting counting ===", file=f)
        for i in range(args.num):
            time.sleep(1)
            print(f"Count is {i + 1}", file=f)
        print("=== Finished counting ===", file=f)


if __name__ == "__main__":
    USAGE="Counts up to a given number"
    parser = argparse.ArgumentParser(description=USAGE)
    parser.add_argument("num", type=int, help="Number to count to")
    args = parser.parse_args()
    main(args)

This will open the file run.out at the start of counting, and write all lines, then close the file.

Run, wait, checkpoint, and kill the code as before:

dmtcp_launch ./count.py 15 &
# Wait a few seconds
dmtcp_command --checkpoint
# Wait a few seconds
dmtcp_command --kill

The file run.out will be there but will be empty. When DMTCP restarts, it will recall what has been previously written and start from there:

dmtcp_restart cktp*.dmtcp

After a few seconds, run.out will appear with:

run.out

=== Starting counting ===
Count is 1
Count is 2
Count is 3
Count is 4
Count is 5
Count is 6
Count is 7
Count is 8
Count is 9
Count is 10
Count is 11
Count is 12
Count is 13
Count is 14
Count is 15
=== Finished counting ===

Now, consider the case where the file is opened and closed multiple times:

count.py

#!/usr/bin/env python3

import argparse
import time 

def main(args):
    filename = "run.out"
    with open(filename, "w") as f:
        print("=== Starting counting ===", file=f)
    for i in range(args.num):
        time.sleep(1)
        with open(filename, "a") as f:
            print(f"Count is {i + 1}", file=f)
    with open(filename, "a") as f:
        print("=== Finished counting ===", file=f)


if __name__ == "__main__":
    USAGE="Counts up to a given number"
    parser = argparse.ArgumentParser(description=USAGE)
    parser.add_argument("num", type=int, help="Number to count to")
    args = parser.parse_args()
    main(args)

Follow the same steps as before:

dmtcp_launch ./count.py 15 &
# Wait a few seconds
dmtcp_command --checkpoint
# Wait a few seconds
dmtcp_command --kill

Now, we see everything in run.out up to the point where the code was killed:

run.out

=== Starting counting ===
Count is 1
Count is 2
Count is 3
Count is 4
Count is 5
Count is 6
Count is 7
Count is 8
Count is 9
Count is 10
Count is 11
Count is 12

What happens when we restart?

dmtcp_restart ckpt*.dmtcp

Checking the output of run.out, we see:

run.out

=== Starting counting ===
Count is 1
Count is 2
Count is 3
Count is 4
Count is 5
Count is 6
Count is 7
Count is 8
Count is 9
Count is 10
Count is 11
Count is 12
Count is 9
Count is 10
Count is 11
Count is 12
Count is 13
Count is 14
Count is 15
=== Finished counting ===

The restarted code started back at the checkpoint, even though the code wrote more after that before getting killed previously. Thus, we see some counts duplicated.

It is up to the way your code handles opening and closing files to know whether you will experience this in your output. The best way to tell is to run a short test and check what happens.

One alternative is to always kill your code immediately after checkpointing. This can be accomplished with:

dmtcp_command --kcheckpoint

The next time you restart your code, this ensures that it didn't write any extra output since the last checkpoint.