Submitting a TensorFlow job

After you've installed TF in your conda environment, we can submit a TF job to the cluster. In order to make use of GPU computing, we'll need to request for a GPU node in the job script through the --gpu directive.

As an example, the TF python code, matmul.tf2.py, is shown below:

import tensorflow as tf
tf.debugging.set_log_device_placement(True)
tf.config.set_soft_device_placement(True)

with tf.device('/device:GPU:2'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
print(c)

Because tf.config.set_soft_device_placement is turned on, even if this code is assigned a CPU-only node, it will still run. The multiplication step will be carried out using the CPU.

Now, let's write our SLURM job script, testTF.sbatch, which contains the following:

#!/bin/bash

# Job name:
#SBATCH --job-name=test_matmul
#
# Request GPU:
#SBATCH --gpus=v100:1
#
# Memory:
#SBATCH --mem-per-cpu=20G
#
# Wall clock limit (minutes or hours:minutes or days-hours):
#SBATCH --time=20
#
# Standard out and error:
#SBATCH --output=%x-%j.SLURMout

echo "This script comes from ICER's TensorFlow example"

# [Insert command to load your Conda] -- see https://docs.icer.msu.edu/Using_conda/

conda activate tf_Jul2024
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib:/mnt/home/user123/miniforge3/envs/tf_Jul2024/lib/python3.10/site-packages/tensorrt

python matmul.tf2.py

conda deactivate

To submit it, run sbatch testTF.sbatch from the command line. The final result will be written to the file test_matmul-<jobid>.SLURMout.