Checkpointing overview
Checkpointing is an incredibly valuable technique in HPC. It allows you to save the state of a running program, stop it, and restart later from where you left off. This gives you flexibility to split up long-running calculations into multiple jobs, including preemptible scavenger jobs that do not count against your CPU quota. You can also use these techniques to avoid the seven day walltime limit on a single job.
However, this flexibility comes at a price. Saving and restarting the state of a program can be very complex, especially if it involves multiple cores or nodes. The "checkpoint" files that store your program's state can also take up a lot of storage space.
There are a few different methods for checkpointing, each with their own strengths and weaknesses. If it's available, application-level checkpointing is usually the most robust, while using an external checkpointing tool is usually the most flexible.
Read on to learn if checkpointing can help you and how you can implement it.
Example use-cases for checkpointing
- Problem: Jobs take longer than 7 days.
-
Solution: Checkpointing allows you to save the progress towards the end of the job and continue in a new job.
- Problem: Long wait times.
-
Solution: You can use a time-limit of less than 4 hours which gives you access to more nodes in the HPCC. You can checkpoint your job towards the end of these 4 hours, and continuously submit new jobs to continue from the last checkpoint. You can also take advantage of idle resources using the Scavenger Queue.
- Problem: You've run out of CPU hours for the year.
-
Solution: You can use checkpointing to take advantage of the Scavenger Queue which does not count against your yearly CPU limits.
- Problem: Your code has random components and can sometimes fail.
-
Solution: Creating checkpoints allows you to restart shortly before the program failed rather than restart from the beginning.
How do I checkpoint jobs on the HPCC?
Built-in application-level checkpointing
In general, you should prefer "application-level checkpointing", that is, checkpointing done within the code of the application you are running. Since the program knows what is happening at any given time, it can save only the relevant data at the most appropriate time.
Some software packages have this built-in. Often "checkpointing" in this context is also referred to as "restarting". The means for checkpointing and restarting vary significantly between programs, but here is an incomplete list of external resources that may be useful in getting started:
- ANSYS
- CONVERGE
- DFTB+
- GAMESS
- Gaussian
- GROMACS
- LAMMPS
- Molpro
- NAMD
- NextFlow
- OpenFOAM
- PyTorch
- Snakemake
- TensorFlow
External checkpointing
Many applications do not have checkpointing capabilities built-in. For this, you can use an external checkpointing software that can pause your program for you. ICER recommends using DMTCP. While not perfect, it is one of the most capable tools available, especially for many of the types of workflows run on the HPCC.
ICER has two methods for interfacing with DMTCP:
- Directly: See our DMTCP tutorial
- Using the
longjobpowertool
Implement application-level checkpointing in your own code
If you are writing your own code, you can choose to what parts of your application to save and set up ways to restart later. While ICER currently does not have any tutorials, here are some external resources that may be useful: