Warning
This is as a Lab Notebook which describes how to solve a specific problem at a specific time. Please keep this in mind as you read and use the content. Please pay close attention to the date, version information and other details.
1. Introduction to the Scavenger Queue
The scavenger queue is a specific queue on the HPCC that allows jobs to fill gaps in the scheduler to improve overall efficiency of the system. The drawback is that jobs scheduled via the scavenger queue can be stopped and restarted at any point. For many jobs this would be detrimental, but for jobs that can leverage checkpointing, we can restart without much loss of progress.
Checkout the documentation on the scavenger queue https://docs.icer.msu.edu/Scavenger_Queue/.
2. Checkpointing
Checkpointing is the process of periodically saving the state of a program. While it is especially useful when working with the scavenger queue, checkpointing can be useful in any scenario where it is important to not lose substantial progress if a program is interrupted for any reason.
The following example demonstrates a basic implementation of checkpointing in Python. This code uses the dill library to save and reload the current variables in python. You can install the dill library in your home directory using one of the following command on a development node (depending on how your Python environment is set up):
pip install --user dill
or
conda install dill
Note
The dill library will work with most of the core python variable types. However, some special variables will not work with dill. Refer to the dill package documentat for how best to use the library with your own workflows.
Once you have dill installed on the HPCC, you can use the following example (called counter.py) on a development node with the python counter.py command. At some random point after 10 cycles, kill the code using ctrl+c, then restart the program. It should save it's state and start where it left off.
import dill
import os
import sys
import time
def checkpointSave(name):
'''Save Entire session to checkpoint file'''
file=open(str(name),"wb+")
dill.dump_session(filename=file)
file.close()
def checkpointLoad(name):
'''Load Entire session to checkpoint file'''
if os.path.exists(str(name)):
print("\n Checkpoint Loading... \n")
with open(str(name),'rb') as file:
data=dill.load_session(filename=file)
print("\n Loaded Data: ",data,"\n")
if __name__ == "__main__":
# Optionally take an argument for the checkpoint filename
if len(sys.argv) > 1:
name=sys.argv[1]
else:
name='checkpoint.pkl'
# Load an existing checkpoint (if present)
data = 0
checkpointLoad(name)
# Do "work"
while data<100000:
data+=1
if data%10==0: # SAVE every 10 iterations
checkpointSave(name)
print("Data=",data)
time.sleep(1)
3. Putting it all together
Now that the code has been tested (and is working) on a development node let's write a slurm script to run it on the scavenger queue.
The following is an example submission script that can be used to run on the scavenger queue:
#!/bin/bash --login
#SBATCH --time 168:00:00
#SBATCH --qos=scavenger
python counter.py
The above should run for an entire week (really we only need 28 hours for the 100,000 seconds). Since the job is in the scavenger queue, at any point it may be stopped to let regularly scheduled jobs on the resources. However, with the checkpoint the job will automatically get restarted and continue on it's happy way.
Obviously you will need to update your code to save all of the important variables (recall from above that not all variable types are compatible with dill) but checkpointing your job will make both it and the system run faster and be robust to errors.
Originally written by Dr. Nathan Haut for CMSE401 and updated by Dr. Dirk Colbry as a ICER Lab Notebook, Michigan State University

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.