Powertools longjob by DMTCP
The following are instructions for trying out longjob
powertool on
HPCC system. First, you start with a basic submission script. For
example, consider the following simple submission script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
To get longjob
to work, the following modifications might need to be
made:
- Change walltime to be less than 4 hours if you would like to have more available nodes to your job.
- Wrap all setup-code that only needs to be run once in an if statement that checks for the file "Files_Copied". This will ensure that the setup-code only runs the first time the script is run because in the first time there should be no file with the name "Files_Copied".
- Add the
longjob
command before the command in the submission script that you want to checkpoint. -
Load the powertools module and turn on aliases. i.e. add the following lines of code to the script:
1 2
shopt -s expand_aliases module load powertools
-
Set the following environment variables as appropriate for your job:
- JobScript – Name of the job script file which will get resubmitted. The default is the first submitted job script name.
- DMTCP_Checkpoint_Time – Time (in seconds) which DMTCP needs to work on checkpointing. The default is 5 minutes.
- DMTCP_CHECKPOINT_INTERVAL – Time (in seconds) between automatic checkpoints. The default is 4-8 hours. For walltime less than 4 hours, the default will do checkpointing once at ${DMTCP_Checkpoint_Time} + 1 minute before the end of walltime.
- DMTCP_CHECKPOINT_DIR – Name of the directory to save
checkpoint image and log flies. The default is
ckpt_${SLURM_JOB_NAME}. For job array, the default is
ckpt_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}. If two
different jobs use the the same directory to run with the
longjob
command, please make sure the environment variables (or SLURM_JOB_NAME) are set different so their image files are not saved in the same directory.
The following is a modified example script with the changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
If everything works as expected, you should be able to submit the
above job script and it will resubmit itself until the job completes.
Note, this is rough code, not completely tested and does work in all
cases. For example, one case that could propose a problem is if the main
program gets caught in a loop and never exits. In this case, the code
will keep submitting itself indefinitely. Note that in addition you
can't include output redirection as you'd expect, that is a command like
myprogram.py > myoutput.txt
and longjob myprogram.py >
myoutput.py
is not the same (the redirection here applies to longjob
,
not your program).
If you have difficulty, please contact us.