Gaussian Job with Checkpointing Run

For running a large system with Gaussian, it usually takes a long time and many resources to complete. It is a good idea to set up checkpointing so the calculation can keep going in case of any interruption due to walltime limit or possible system malfunction. The checkpointing function can save a snapshot of a Gaussian running state so it can restart from the previous calculation. Users can also divide a long-time job into many 4-hour short jobs since jobs with walltime less than or equal to 4 hours can use the buy-in nodes (55% of all nodes) on the HPCC.

In order to have an appropriate checkpointing run with Gaussian, an unified read-write file setting (%RWF) should be in the Link 0 section of the input file. An example water.gjf is in the following:

water.gjf

%NProcShared=2
%Mem=3GB
%RWF=water.rwf
%NoSave
%chk=water.chk
#P opt b3lyp/aug-cc-pVTZ

water molecules

0 1
O   -2.12123400  1.99409800 -1.27381200
H    1.52438600  0.53672100  0.67508800
H    1.76493000 -0.81527300 -0.18137000
O   -1.12977500 -0.31430400 -0.37860700
H   -1.76492800 -0.81528500  0.18137200
H   -1.52439700  0.53670800 -0.67510100
O    2.89125300 -1.69896600 -1.06351900
O    1.12976700 -0.31428900  0.37859300
H    2.99568600 -1.73945400 -2.01677200
H    3.39746100 -2.42787400 -0.69708600
O   -2.89123000 -1.69896400  1.06353700
H   -2.99563400 -1.73945600  2.01679300
H   -2.43456700  2.07972500 -2.17761600
H   -2.58174600  2.66131900 -0.75942800
H   -3.39743400 -2.42788000  0.69711700

The input file requests geometry optimization of 5 water molecules with a very large basis set aug-cc-pVTZ. It will take about 25 CPU hours to finish the whole calculation. We have the setting on %RWF which specifies water.rwf file for the checkpointing function besides the water.chk file. Since the specification %RWF is placed before the %NoSave line, the rwf file will be deleted if the calculation is normally completed without any error.

In order to have several restarts running after the first run stops, we can build a restart Gaussian input file restart.gjf simply as

restart.gjf

%NProcShared=2
%Mem=3GB
%RWF=waters.rwf
%NoSave
%chk=waters.chk
#P Restart

Since all information about the calculation is recorded in the rwf file, a line with "Restart" is enough for Gaussian to restart from the previous job. This restart input file can also be created by the commands:

grep '^%' waters.gjf > restart.gjf
echo -e '#P Restart\n' >> restart.gjf

where we simply "grep" the lines starting with "%" sign in water.gjf and put them in the Gaussian restart file with "#P Restart" line in the end.

Now we need a job script to submit the Gaussian calculation. The script needs to keep submitting jobs to restart the previous calculation until it is completed. Here is a job script water.sb which can do the work:

water.sb

#SBATCH –-job-name=LongJob
#SBATCH –-ntasks=1
#SBATCH –-cpus-per-task=2
#SBATCH --mem=5G
#SBATCH –-time=04:00:00

echo "This script is from ICER's tutorial on checkpointing Gaussian"

module load Gaussian/g16 powertools
OutputFile="water-${SLURM_JOBID}.log"             # Gaussian output file name for each job

# How many seconds before end of job to submit another
BeforeEnd=300                                       # 5 minutes

# The background script to keep job submission until calculation is completed
(sleep $((4*60*60 - BeforeEnd))                     # sleep until the time before end of job
js -j ${SLURM_JOBID}                                # print out resource usage
cat ${OutputFile} >> water.log                      # collect Gaussian outputs into one file
echo -e "\n\n====== Gaussian calculation on job ${SLURM_JOBID} stops. ======\n\n" >> water.log
echo "The Gaussian calculation has not completed. Submit another job to keep doing it."
sbatch water.sb                                     # submit another job
scancel ${SLURM_JOBID}  )&                          # job stops if g16 command is not finished

# Whether this is a restart job or not
if [ -f water.rwf ] && [ -f water.chk ]; then
   InputFile="restart.gjf"
else
   InputFile="water.gjf"
fi

g16 < ${InputFile} > ${OutputFile}

# The following commands are not executed unless g16 command is completed.
# Print out resource usage 
js -j $SLURM_JOB_ID           ### powetools command

cat ${OutputFile} >> water.log 
echo -e "\n\n====== Gaussian calculation is completed on job ${SLURM_JOBID}. ======\n\n" >> water.log

where a background script in (---)& from line 14 to 20 is added to keep submitting jobs.

Once the job is started, the background script is running at the same time as the foreground script. The background script is in sleep for 3 hours and 55 minutes first. During this time, the foreground script runs the Gaussian calculation or restarts the previous calculation if the checkpointing files water.rwf and water.chk exist. After 5 minutes before the end of the job, the background is awake to print out the resource usage and Gaussian output. It submits another job and stops the current running job in line 19 and 20 if the g16 command in line 29 is not completed. If the g16 command is finished before the background script is awake, the job will keep executing all command lines after line 30 and finish. There will be no more jobs submitted.

Since the rwf file usually takes a lot of file space, it is suggested to run checkpointing jobs in scratch space in case your home or research space is over quota. Users can create a directory in their scratch space. Copy all files (water,gjf, restart.gjf and water.sb) and submit the job script there. Please check your job status frequently. Make sure to copy necessary files back to your home or research directory from time to time since files on scratch will be purged if they have not been modified for 45 days.

Note

The time for running the background script needs to be longer than the time needed for a cycle of Gaussian analysis to avoid restarting from the point of previous run again. The checkpointing is done between cycles.