Skip to content

Powertools longjob by DMTCP

The following are instructions for trying out longjob powertool on HPCC system. First, you start with a basic submission script. For example, consider the following simple submission script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash -login
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1
#SBATCH --time=168:00:00
#SBATCH --mem=2gb
#SBATCH --constraint=intel16
#SBATCH --job-name=MyJob0
#SBATCH --mail-type=FAIL,END

module purge
module load GCC/7.3.0-2.30 OpenMPI/3.1.1

srcdir=${SLURM_SUBMIT_DIR}/bin/
WORK=/mnt/scratch/${USER}/KineticSN/${SLURM_JOBID}
mkdir -p ${WORK}

# Copy files to work directory
cp -r $srcdir/* $WORK/

#Move to the working directory
cd $WORK

#Run my program
./SimulationTest -scattering_flag 0 -weak_reaction_flag 0 -outputVisData 100
ret=$?

scontrol show job ${SLURM_JOBID}

exit $ret

To get longjob to work, the following modifications might need to be made:

  1. Change walltime to be less than 4 hours if you would like to have more available nodes to your job.
  2. Wrap all setup-code that only needs to be run once in an if statement that checks for the file "Files_Copied". This will ensure that the setup-code only runs the first time the script is run because in the first time there should be no file with the name "Files_Copied".
  3. Add the longjob command before the command in the submission script that you want to checkpoint.
  4. Load the powertools module and turn on aliases. i.e. add the following lines of code to the script:

    1
    2
    shopt -s expand_aliases
    module load powertools
    
  5. Set the following environment variables as appropriate for your job:

    • JobScript  –  Name of the job script file which will get resubmitted. The default is the first submitted job script name.
    • DMTCP_Checkpoint_Time  –  Time (in seconds) which DMTCP needs to work on checkpointing. The default is 5 minutes.
    • DMTCP_CHECKPOINT_INTERVAL  –  Time (in seconds) between automatic checkpoints. The default is 4-8 hours. For walltime less than 4 hours, the default will do checkpointing once at ${DMTCP_Checkpoint_Time} + 1 minute before the end of walltime. 
    • DMTCP_CHECKPOINT_DIR  –  Name of the directory to save checkpoint image and log flies. The default is ckpt_${SLURM_JOB_NAME}. For job array, the default is ckpt_${SLURM_JOB_NAME}_${SLURM_ARRAY_TASK_ID}. If two different jobs use the the same directory to run with the longjob command, please make sure the environment variables (or SLURM_JOB_NAME) are set different so their image files are not saved in the same directory.

The following is a modified example script with the changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash -login
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1
#SBATCH --time=04:00:00
#SBATCH --mem=2gb
#SBATCH --constraint=intel16
#SBATCH --job-name=MyJob0
#SBATCH --mail-type=FAIL,END

echo "This script is from ICER's longjob tutorial"

module purge
module load GCC/7.3.0-2.30 OpenMPI/3.1.1
module load powertools

# Change checkpointing environment variables if necessary:
# export DMTCP_Checkpoint_Time=60                     -- change checkpointing time
# export DMTCP_CHECKPOINT_INTERVAL=7200               -- change time interval between checkpoints
# export DMTCP_CHECKPOINT_DIR=ckpt_${SLURM_JOB_NAME}  -- change where to save checkpointing files

# Change to a directory other than ${SLURM_SUBMIT_DIR} if necessary:
# cd /mnt/scratch/${USER}/WorkPlace

if [ ! -f Files_Copied ]
then
    srcdir=${SLURM_SUBMIT_DIR}/bin/
    WORK=/mnt/scratch/${USER}/KineticSN/${SLURM_JOBID}
    mkdir -p ${WORK}

    # Copy files to work directory
    cp -r $srcdir/* $WORK/

    #Run main simulation program
    cd $WORK
    touch Files_Copied 

fi
longjob ./SimulationTest -scattering_flag 0 -weak_reaction_flag 0 -outputVisData 100
ret=$?

exit $ret

If everything works as expected, you should be able to submit the above job script and it will resubmit itself until the job completes. Note, this is rough code, not completely tested and does work in all cases. For example, one case that could propose a problem is if the main program gets caught in a loop and never exits. In this case, the code will keep submitting itself indefinitely.  Note that in addition you can't include output redirection as you'd expect, that is a command like myprogram.py > myoutput.txt  and longjob myprogram.py > myoutput.py is not the same (the redirection here applies to longjob, not your program).  

If you have difficulty, please contact us.