Checkpoint your programs with DMTCP
This tutorial will focus on the tool DMTCP to checkpoint your code. While not perfect, it is one of the most capable tools available, especially for many of the types of workflows run on the HPCC.
For an overview of why you might want to checkpoint and alternative approaches, please see our Checkpointing Overview.
After you complete this tutorial, implement these techniques in your batch jobs using the Checkpointing with DMTCP in batch jobs tutorial.
If you want to skip to the final product, see our Template for general purpose checkpointing scripts.
Learning objectives
After this tutorial, you should:
- Understand what checkpointing is
- Be able to checkpoint a sample Python script running interactively
- Have the tools to checkpoint your own, more complex code
Sample script
Let's start with a sample Python script. This script will be short and should be mostly readable even if you do not use Python. The goal of this script is to mimic a long-running program that you would actually be interested in checkpointing.
First, log into a development node on the HPCC, and create a new directory to save your code:
cd ~
mkdir checkpoint
cd checkpoint
Now, save the following code to a file called count.py:
Sample script (click to expand)
#!/usr/bin/env python3
import argparse
import time
def main(args):
print("=== Starting counting ===")
for i in range(args.num):
time.sleep(1)
print(f"Count is {i + 1}")
print("=== Finished counting ===")
if __name__ == "__main__":
USAGE="Counts up to a given number"
parser = argparse.ArgumentParser(description=USAGE)
parser.add_argument("num", type=int, help="Number to count to")
args = parser.parse_args()
main(args)
Make this script executable by running
chmod u+x count.py
When you run this code, you give it a positive number (called num in the argument parser). The code will count up to this number, waiting for one second in between each count. Give it a try:
./count.py 5
=== Starting counting ===
Count is 1
Count is 2
Count is 3
Count is 4
Count is 5
=== Finished counting ===
This should take five seconds.
To redirect the output of this script to a file, run
./count.py 5 > run.out
Running cat run.out, you should see the same output as above.
Though it is simple, this sample workflow of "run a program, create some output, save it to a file", mimics most of the work that happens on the HPCC.
Load the DMTCP module
We'll start by making this code run for a long time on a development node, and then checkpoint it. To run the commands in this section, you will need to have the DMTCP module loaded:
module load DMTCP/4.0.0
Launch the code with DMTCP
We will use DMTCP to "launch" the code. This will start a coordinator in the background that can keep track of your code while it's running, and eventually make checkpoints and kill the code.
To give us enough time to see the effects, we will launch the code with DMTCP (dmtcp_launch), running for one minute (./count.py 60 > run.out) in the background (&):
dmtcp_launch ./count.py 60 > run.out &
Where's my output?
If you take a look at run.out while your code is running (e.g., by using tail -f run.out to "follow the tail" of the file, or see new lines at the bottom as it updates), you will not see anything while the code runs. When it finishes, you will see everything display at once.
Python "buffers" output, which means it stores everything that should be output, and then writes it out when it's ready. In the case of redirecting to a file, it will wait for everything before writing all at once. This is different than the behavior when writing to standard out.
You can make Python write output with "unbuffered output" by calling python -u count.py ..., but this can have some unintended consequences. We will talk more about how DMTCP handles checkpointing with files in the next tutorial.
Making a checkpoint
While the previous code is running, we can make a checkpoint using dmtcp_command. Try running this command after approximately 30 seconds:
dmtcp_command --checkpoint
Now wait until the original code is finished. To check, hit the enter key at the command line after about one minute, and eventually, you will see a notification that your code is done:
[1]+ Done dmtcp_launch ./count.py 60 > run.out
Let's see what's in our directory:
ls
ckpt_python3.11_7ccc3c471e99a49a-40000-9fe9e4c0fa783.dmtcp
count.py
dmtcp_restart_script_7ccc3c471e99a49a-40000-9fe9e4a62f3ac.sh
dmtcp_restart_script.sh
run.out
We see our output file, run.out (run cat run.out to verify that it finished all 60 counting steps), and three files created by DMTCP. One is a checkpoint file ckpt_python3.11_... that saves the state of the program when we ran dtmcp_command --checkpoint. We also have a script, dmtcp_restart_script_7ccc3c... that will restart the code from where we checkpointed. Finally, dmtcp_restart_script.sh is a shortcut to that restart script, that will update to the latest checkpoint if you create more than one.
Restarting checkpointed code
Let's restart the code from the checkpoint we took. For reasons that will become clear later, it is better to restart directly from the checkpoints than use the restart script:
dmtcp_restart ckpt*.dmtcp > run.out
Importantly, we need to redirect the output of this script to run.out as well. DMTCP doesn't keep a record of where the output was being redirected; it's more helpful to think of our original command as:
(dmtcp_launch ./count.py 60) > run.out
rather than
dmtcp_launch (./count.py 60 > run.out)
I can't type anymore!
DMTCP likes to take control of your terminal and not always give it back. After dmtcp_restart finishes, you may not be able to see what you're typing. However, it is still getting input! Try typing
reset
to reset your terminal. If you get totally stuck, just close the terminal and log back in.
After letting the dmtcp_restart_script.sh run for about 30 seconds (or however long was left when you checkpointed), you should see it finish. Run
cat run.out
and you will see all 60 counting steps, not just the latter half that the restart script ran.
Notice that your checkpoint and restart scripts stay behind. You could continue to rerun this code from its checkpoint to your heart's content.
Now that you know the basic DMTCP workflow, move on to implementing checkpointing in your batch scripts.