Job with Checkpointing Run
Checkpointing is a function to save a snapshot of an application's running state, so it can restart from the saved point in case job running fails or reaches the time limit. Some applications might already have this feature for long-term computation. If users develop their own program, it is encouraged to implement checkpointing as a part of their codes. They can develop a function to write result variables to file systems at regular intervals and a function to read those variables in when restart.
However if the program you used does not and can not include the feature, you may consider using "Distributed MultiThreaded CheckPointing" (or DMTCP) installed on HPCC nodes. DMTCP is a tool for transparently checkpointing the state of a distributed program spread across many machines without modifying the user's program or the operating system kernel.
On the HPCC, you can use DMTCP directly or use the
longjob
powertool to automate this process
for you. For more details about DMTCP, please refer to their web
site.