Template for a general purpose checkpointing script
Below is a have template for checkpointing code using DMTCP. For details on the specific steps and a walkthrough to create your own, please see our tutorials for Checkpointing with DMTCP and Checkpointing with DMTCP in batch jobs.
Each value that you need to decide for your code is wrapped in angle brackets and has a comment labeled # REPLACE that gives an example of what should replace that value.
checkpoint_script.sh
#!/bin/bash --login
#SBATCH --time=<time> # REPLACE: e.g., 01:10:30
#SBATCH --cpus-per-task=<cpus> # REPLACE: e.g., 16
#SBATCH --mem=<mem> # REPLACE: e.g., 16GB
#SBATCH --constraint=<nodetype> # REPLACE: e.g., amd20
#SBATCH --job-name=<name> # REPLACE: e.g., checkpoint-job
#SBATCH --signal=B:USR1@<time> # REPLACE: e.g., 60 to signal 60 seconds before the job completes
#SBATCH --dependency=singleton
#SBATCH --qos=scavenger
# Load modules
module purge
module load <module> # REPLACE, e.g., Python/3.11.3-GCCcore-12.3.0 (add additional modules as necessary)
module load DMTCP/4.0.0
# Set location for checkpoint files
export DMTCP_CHECKPOINT_DIR="${SCRATCH}/${SLURM_JOB_NAME}"
mkdir -p DMTCP_CHECKPOINT_DIR
# Set up a signal handler to checkpoint, finish the script, and restart with 30
# seconds left
# REPLACE: filename should be the name of this file, so the job resubmits itself
trap "echo stopping job; dmtcp_command --kcheckpoint; sbatch <filename>; exit 0" USR1
echo "Script has started"
if [ -f dmtcp_restart_script.sh ]; then
# Restart a previous checkpoint if it exists
echo "Restarting from a checkpoint"
# REPLACE: If your code redirects standard output to a file, make sure the
# output of the command below pipes it to the same file
dmtcp_restart "${DMTCP_CHECKPOINT_DIR}/ckpt_*.dmtcp" &
else
# Launch the code with DMTCP
echo "Starting from scratch"
# REPLACE: interval should be number of seconds between checkpoints and
# your-code-here should be the command you use to run your code
dmtcp_launch --interval <interval> <your-code-here> &
fi
wait
echo "Script has finished"
Some considerations when filling in the values above:
- Checkpointing can take a long time, especially if your code uses a lot of memory. Consider giving yourself five minutes or more for each checkpoint (e.g., signal five minutes ahead of time, and make the interval much longer than five minutes so you can get work done in between checkpointing steps)
- Watch to make sure your code is making progress in between jobs. Otherwise, your code could just get stuck in a loop, starting and restarting the same steps over again. If it doesn't make any progress, increase the time and interval length.
- Clear out the checkpoint files in your scratch space after your job has completed.