Skip to content

Job with Checkpointing Run

Checkpointing is a function to save a snapshot of an application's running state, so it can restart from the saved point in case job running fails or reaches the time limit. Some applications might already have this feature for long-term computation. If users develop their own program, it is encouraged to implement checkpointing as a part of their codes. They can develop a function to write result variables to file systems at regular intervals and a function to read those variables in when restart.

However if the program you used does not and can not include the feature, you may consider using "Distributed MultiThreaded CheckPointing" (or DMTCP) installed on HPCC nodes. DMTCP is a tool for transparently checkpointing the state of a distributed program spread across many machines without modifying the user's program or the operating system kernel. For more details about DMTCP, please refer to their web site.

Below we show examples of using DMTCP in HPCC system:

Checkpoint with DMTCP

It shows an example job script to do checkpointing by DMTCP commands.

Powertools longjob by DMTCP

It introduces how to use the longjob powertool for checkpointing on HPCC.