To run DMTCP correctly, the limit of stack size can not be
"unlimited". To set it to a number n, run ulimit -s n. For example,
ulimit -s 8192 would set the stack size as 8192. User could also add
it into .bashrc file. It should also be set in the job batch script, as
seen below.
DMTCP (Distributed MultiThreaded Checkpointing) transparently
checkpoints both single and multi-node computations. It
supports MPI (various implementations), OpenMP, MATLAB, Python, Perl, R,
and many programming languages and shell scripting languages. DMTCP allows one to
checkpoint running programs to disk, restart calculations from a
checkpoint, or even migrate the processes to another host by
moving the checkpoint files prior to restarting.
Each computation you wish to checkpoint requires one DMTCP coordinator.
First, the DMTCP coordinator process is started on one host.
Then application binaries are started with the dmtcp_launch command, causing
them to connect to the coordinator upon startup. As threads are spawned,
child processes are forked, remote processes are spawned via ssh, libraries are dynamically loaded, etc., DMTCP transparently and automatically
tracks them. To checkpoint, use dmtcp_coordinator command to start
checkpointing. To restart from a checkpoint, use dmtcp_restart.
By default, DMTCP uses gzip to compress the checkpoint images. This
can be turned off. This will be faster, and if your memory is dominated
by incompressible data, this can be helpful. gzip can add seconds for
large checkpoint images. Typically, checkpoint and restart is less than
one second without gzip.
Using DMTCP
Running a program with checkpointing usually involves the following 4
steps (option settings may be needed for special cases) :
Start DMTCP coordinater
$ dmtcp_coordinator --daemon --exit-on-last $@ 1\>/dev/null
2\>&1 #run coordinator as daemon in background
Launch program
$ dmtcp_launch ./a.out # launch ./a.out
Trigger checkpointing. This will
generate a set of checkpointing image files (file type: .dmtcp) and
a shell script for restart.
$ dmtcp_command --bcheckpoint # checkpointing
Restart: the DMTCP coordinator will write
a script, dmtcp_restart_script.sh, along with a checkpoint file
(file type: .dmtcp) for each client process. The simplest way to
restart a previously checkpointed computation is:
$ ./dmtcp_restart_script.sh # restart
using script
Alternatively, if all processes were on the same processor, and there were no .dmtcp files prior to this checkpoint: $ dmtcp_restart ckpt_*.dmtcp
DCTMP example
The following is the sample script longjob.sb that uses DMTCP for
checkpointing a long job. In this way, the job can be run as a sequence of
short walltime jobs. To obtain the complete example, run module load
powertools; getexample dmtcp_longjob.
#!/bin/bash -login## resource requests for task:#SBATCH -J count-longjob # Job Name#SBATCH --time=00:06:00 # Note that 6 min is not enough to complete the job. It enough for checkpointing and resubmit job#SBATCH -N 1 -c 1 --mem=20MB # requested resource#SBATCH --constraint=lac # user could add other requests as usual.# set a limited stack size so DMTCP could workulimit-s8192# current working directory shuld have source code dmtcp1.ccd${SLURM_SUBMIT_DIR}# this script file name. This script may be resubmit multiple times until job completedexportSLURM_JOBSCRIPT="longjob.sb"######################## start dmtcp_coordinator #######################fname=port.$SLURM_JOBID# to store port number
dmtcp_coordinator--daemon--exit-on-last-p0--port-file$fname$@1>/dev/null2>&1# start coordinaterh=`hostname`# get coordinator's host namep=`cat$fname`# get coordinator's port numberexportDMTCP_COORD_HOST=$h# save coordinators host info in an environment variableexportDMTCP_COORD_PORT=$p# save coordinators port info in an environment variable#rm $fname# uncommand following lines to print out some information if user wish#echo "coordinator is on host $DMTCP_COORD_HOST "#echo "port number is $DMTCP_COORD_PORT "#echo " working directory: ${SLURM_SUBMIT_DIR} "#echo " job script is $SLURM_JOBSCRIPT "####################### BODY of the JOB ####################### prepare work environment of the job
moduleswapGNU/6.4.0-2.28GCC/4.9.2
# build the program if executable file does not existif[!-fcount.exe]thencccount.c-ocount.exe
fi# run the program count.exe. # To run interactively: # $ ./count.exe n num.odd 1> num.even # it will count to number n and generate 2 files: # num.odd contains all the odd number;# num.even contains all the even number.# To run with DMTCP, use dmtcp commamds.# if first time launch, use "dmtcp_launch"# otherwise use "dmtcp_restart"# set checkpoint interval. This script would wait after dmtcp_launch# the job for the interval (in seconds), then do start the checkpoint. exportCKPT_WAIT_SEC=$((3*60))# checkpointing when program runs for 3 min# Launch or restart the executionif[!-fckpt_*.dmtcp]# if no ckpt file exists, it is first time run, use dmtcp_launchthen# first time run, use dmtcp_launch to start the job and run on background */dmtcp_launch-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT--rm--ckpt-open-files./count.exe800num.odd1>num.even10>&-11>&-&#wait for an inverval of checkpoint seconds to start checkpointingsleep$CKPT_WAIT_SEC# start checkpointingdmtcp_command-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT--ckpt-open-files--bcheckpoint
# kill the running job after checkpointingdmtcp_command-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT--quit
# resubmit the jobsbatch$SLURM_JOBSCRIPTelse# it is a restart run# restart job with checkpoint files ckpt_*.dmtcp and run in backgrounddmtcp_restart-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORTckpt_*.dmtcp1>num.even&# wait for a checkpoint interval to start checkpointingsleep$CKPT_WAIT_SEC# if program is still running, do the checkpoint and resubmitifdmtcp_command-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT-s1>/dev/null2>&1then# clean up old ckpt files before start checkpointingrm-rckpt_*.dmtcp
# checkpointing the jobdmtcp_command-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT--ckpt-open-files-bc
# kill the running program and quitdmtcp_command-h$DMTCP_COORD_HOST-p$DMTCP_COORD_PORT--quit
# resubmit this script to slurmsbatch$SLURM_JOBSCRIPTelseecho"job finished"fifi# show the job status info
scontrolshowjob$SLURM_JOB_ID