Skip to content

Checkpoint with DMTCP

Note

To run DMTCP correctly, the limit of stack size can not be "unlimited". To set it to a number n, run ulimit -s n. For example, ulimit -s 8192 would set the stack size as 8192. User could also add it into .bashrc file.  It should also be set in the job batch script, as seen below.

DMTCP (Distributed MultiThreaded Checkpointing) transparently checkpoints both single and multi-node computations. It supports MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages. DMTCP allows one to checkpoint running programs to disk, restart calculations from a checkpoint, or even migrate the processes to another host by moving the checkpoint files prior to restarting.

Each computation you wish to checkpoint requires one DMTCP coordinator. First, the DMTCP coordinator process is started on one host. Then application binaries are started with the dmtcp_launch command, causing them to connect to the coordinator upon startup. As threads are spawned, child processes are forked, remote processes are spawned via ssh, libraries are dynamically loaded, etc., DMTCP transparently and automatically tracks them. To checkpoint, use dmtcp_coordinator command to start checkpointing. To restart from a checkpoint, use dmtcp_restart.

By default, DMTCP uses gzip to compress the checkpoint images. This can be turned off. This will be faster, and if your memory is dominated by incompressible data, this can be helpful. gzip can add seconds for large checkpoint images. Typically, checkpoint and restart is less than one second without gzip.

Using DMTCP

Running a program with checkpointing usually involves the following 4 steps (option settings may be needed for special cases) :

  1. Start DMTCP coordinater 
    • $ dmtcp_coordinator --daemon --exit-on-last $@ 1\>/dev/null 2\>&1 #run coordinator as daemon in background
  2. Launch program
    • $ dmtcp_launch ./a.out # launch ./a.out
  3. Trigger checkpointing. This will generate a set of checkpointing image files (file type: .dmtcp) and a shell script for restart.
    • $ dmtcp_command --bcheckpoint # checkpointing
  4. Restart: the DMTCP coordinator will write a script, dmtcp_restart_script.sh, along with a checkpoint file (file type: .dmtcp) for each client process. The simplest way to restart a previously checkpointed computation is:
    • $ ./dmtcp_restart_script.sh # restart using script
    • Alternatively, if all processes were on the same processor, and there were no .dmtcp files prior to this checkpoint: $ dmtcp_restart ckpt_*.dmtcp

DCTMP example

The following is the sample script longjob.sb that uses DMTCP for checkpointing a long job. In this way, the job can be run as a sequence of short walltime jobs. To obtain the complete example, run module load powertools; getexample dmtcp_longjob.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
#!/bin/bash -login

## resource requests for task:

#SBATCH -J count-longjob                  # Job Name

#SBATCH --time=00:06:00                   # Note that 6 min is not enough to complete the job. It enough for checkpointing and resubmit job

#SBATCH -N 1 -c 1 --mem=20MB              # requested resource

#SBATCH --constraint=lac                  # user could add other requests as usual.

# set a limited stack size so DMTCP could work
ulimit -s 8192

# current working directory shuld have source code dmtcp1.c
cd ${SLURM_SUBMIT_DIR}

# this script file name. This script may be resubmit multiple times until job completed
export SLURM_JOBSCRIPT="longjob.sb"


######################## start dmtcp_coordinator #######################

fname=port.$SLURM_JOBID                                                                 # to store port number

dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1   # start coordinater

h=`hostname`                                                                            # get coordinator's host name

p=`cat $fname`                                                                          # get coordinator's port number

export DMTCP_COORD_HOST=$h                                                  # save coordinators host info in an environment variable

export DMTCP_COORD_PORT=$p                                                  # save coordinators port info in an environment variable

#rm $fname




# uncommand following lines to print out some information if user wish

#echo "coordinator is on host $DMTCP_COORD_HOST "

#echo "port number is $DMTCP_COORD_PORT "

#echo " working directory: ${SLURM_SUBMIT_DIR} "

#echo " job script is $SLURM_JOBSCRIPT "




####################### BODY of the JOB ######################

# prepare work environment of the job

module swap GNU/6.4.0-2.28 GCC/4.9.2


# build the program if executable file does not exist

if [ ! -f count.exe ] 

then

    cc count.c -o count.exe

fi




# run the program count.exe. 

# To run interactively: 

#    $ ./count.exe n num.odd 1> num.even 

# it will count to number n and generate 2 files: 

# num.odd contains all the odd number;

# num.even contains all the even number.



# To run with DMTCP, use dmtcp commamds.

# if first time launch, use "dmtcp_launch"

# otherwise use "dmtcp_restart"




# set checkpoint interval. This script would wait after dmtcp_launch

# the job for the interval (in seconds), then do start the checkpoint. 

export CKPT_WAIT_SEC=$(( 3 * 60 ))            # checkpointing when program runs for 3 min




# Launch or restart the execution

if [ ! -f ckpt_*.dmtcp ]                      # if no ckpt file exists, it is first time run, use dmtcp_launch

then

  # first time run, use dmtcp_launch to start the job and run on background */

  dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 800 num.odd 1> num.even 10>&- 11>&- &




  #wait for an inverval of checkpoint seconds to start checkpointing
  sleep $CKPT_WAIT_SEC



  # start checkpointing
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint


  # kill the running job after checkpointing
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit


  # resubmit the job
  sbatch $SLURM_JOBSCRIPT


else            # it is a restart run

  # restart job with checkpoint files ckpt_*.dmtcp and run in background
  dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even &


  # wait for a checkpoint interval to start checkpointing
  sleep $CKPT_WAIT_SEC


  # if program is still running, do the checkpoint and resubmit

  if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1
  then   
    # clean up old ckpt files before start checkpointing
    rm -r ckpt_*.dmtcp

    # checkpointing the job
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc

    # kill the running program and quit
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit

    # resubmit this script to slurm
    sbatch $SLURM_JOBSCRIPT

  else

    echo "job finished"

  fi

fi

# show the job status info
scontrol show job $SLURM_JOB_ID