Checkpoint with DMTCP
Note
To run DMTCP correctly, the limit of stack size can not be
"unlimited". To set it to a number n, run ulimit -s n
. For example,
ulimit -s 8192
would set the stack size as 8192. User could also add
it into .bashrc file. It should also be set in the job batch script, as
seen below.
DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints both single and multi-node computations. It supports MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages. DMTCP allows one to checkpoint running programs to disk, restart calculations from a checkpoint, or even migrate the processes to another host by moving the checkpoint files prior to restarting.
Each computation you wish to checkpoint requires one DMTCP coordinator.
First, the DMTCP coordinator process is started on one host.
Then application binaries are started with the dmtcp_launch
command, causing
them to connect to the coordinator upon startup. As threads are spawned,
child processes are forked, remote processes are spawned via ssh, libraries are dynamically loaded, etc., DMTCP transparently and automatically
tracks them. To checkpoint, use dmtcp_coordinator
command to start
checkpointing. To restart from a checkpoint, use dmtcp_restart
.
By default, DMTCP uses gzip
to compress the checkpoint images. This
can be turned off. This will be faster, and if your memory is dominated
by incompressible data, this can be helpful. gzip
can add seconds for
large checkpoint images. Typically, checkpoint and restart is less than
one second without gzip
.
Using DMTCP
Running a program with checkpointing usually involves the following 4 steps (option settings may be needed for special cases) :
- Start DMTCP coordinater
$ dmtcp_coordinator --daemon --exit-on-last $@ 1\>/dev/null 2\>&1 #run coordinator as daemon in background
- Launch program
$ dmtcp_launch ./a.out # launch ./a.out
- Trigger checkpointing. This will
generate a set of checkpointing image files (file type:
.dmtcp
) and a shell script for restart.$ dmtcp_command --bcheckpoint # checkpointing
- Restart: the DMTCP coordinator will write
a script,
dmtcp_restart_script.sh
, along with a checkpoint file (file type:.dmtcp
) for each client process. The simplest way to restart a previously checkpointed computation is:$ ./dmtcp_restart_script.sh # restart using script
- Alternatively, if all processes were on the same processor, and there were no
.dmtcp
files prior to this checkpoint:$ dmtcp_restart ckpt_*.dmtcp
DCTMP example
The following is the sample script longjob.sb
that uses DMTCP for
checkpointing a long job. In this way, the job can be run as a sequence of
short walltime jobs. To obtain the complete example, run module load
powertools; getexample dmtcp_longjob
.
#!/bin/bash -login
## resource requests for task:
#SBATCH -J count-longjob # Job Name
#SBATCH --time=00:06:00 # Note that 6 min is not enough to complete the job. It enough for checkpointing and resubmit job
#SBATCH -N 1 -c 1 --mem=20MB # requested resource
#SBATCH --constraint=lac # user could add other requests as usual.
echo "This script is from ICER's DMTCP tutorial"
# set a limited stack size so DMTCP could work
ulimit -s 8192
# current working directory should have source code dmtcp1.c
cd ${SLURM_SUBMIT_DIR}
# this script file name. This script may be resubmit multiple times until job completed
export SLURM_JOBSCRIPT="longjob.sb"
######################## start dmtcp_coordinator #######################
fname=port.$SLURM_JOBID # to store port number
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1 # start coordinater
h=`hostname` # get coordinator's host name
p=`cat $fname` # get coordinator's port number
export DMTCP_COORD_HOST=$h # save coordinators host info in an environment variable
export DMTCP_COORD_PORT=$p # save coordinators port info in an environment variable
#rm $fname
# uncommand following lines to print out some information if user wish
#echo "coordinator is on host $DMTCP_COORD_HOST "
#echo "port number is $DMTCP_COORD_PORT "
#echo " working directory: ${SLURM_SUBMIT_DIR} "
#echo " job script is $SLURM_JOBSCRIPT "
####################### BODY of the JOB ######################
# prepare work environment of the job
module swap GNU/6.4.0-2.28 GCC/4.9.2
# build the program if executable file does not exist
if [ ! -f count.exe ]
then
cc count.c -o count.exe
fi
# run the program count.exe.
# To run interactively:
# $ ./count.exe n num.odd 1> num.even
# it will count to number n and generate 2 files:
# num.odd contains all the odd number;
# num.even contains all the even number.
# To run with DMTCP, use dmtcp commamds.
# if first time launch, use "dmtcp_launch"
# otherwise use "dmtcp_restart"
# set checkpoint interval. This script would wait after dmtcp_launch
# the job for the interval (in seconds), then do start the checkpoint.
export CKPT_WAIT_SEC=$(( 3 * 60 )) # checkpointing when program runs for 3 min
# Launch or restart the execution
if [ ! -f ckpt_*.dmtcp ] # if no ckpt file exists, it is first time run, use dmtcp_launch
then
# first time run, use dmtcp_launch to start the job and run on background */
dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 800 num.odd 1> num.even 10>&- 11>&- &
#wait for an interval of checkpoint seconds to start checkpointing
sleep $CKPT_WAIT_SEC
# start checkpointing
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint
# kill the running job after checkpointing
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
# resubmit the job
sbatch $SLURM_JOBSCRIPT
else # it is a restart run
# restart job with checkpoint files ckpt_*.dmtcp and run in background
dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even &
# wait for a checkpoint interval to start checkpointing
sleep $CKPT_WAIT_SEC
# if program is still running, do the checkpoint and resubmit
if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1
then
# clean up old ckpt files before start checkpointing
rm -r ckpt_*.dmtcp
# checkpointing the job
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc
# kill the running program and quit
dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
# resubmit this script to slurm
sbatch $SLURM_JOBSCRIPT
else
echo "job finished"
fi
fi
# show the job status info
scontrol show job $SLURM_JOB_ID