Submitting a TensorFlow job
After you've installed TF in your conda environment, we can submit a TF job to the cluster. In order to make use of GPU computing, we'll need to request for a GPU node in the job script through the --gpu
directive.
As an example, the TF python code, matmul.tf2.py
, is shown below:
import tensorflow as tf
tf.debugging.set_log_device_placement(True)
tf.config.set_soft_device_placement(True)
with tf.device('/device:GPU:2'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
print(c)
Because tf.config.set_soft_device_placement
is turned on, even if this code is assigned a CPU-only node, it will still run. The multiplication step will be carried out using the CPU.
Now, let's write our SLURM job script, testTF.sbatch
, which contains the following:
#!/bin/bash
# Job name:
#SBATCH --job-name=test_matmul
#
# Request GPU:
#SBATCH --gpus=v100:1
#
# Memory:
#SBATCH --mem-per-cpu=20G
#
# Wall clock limit (minutes or hours:minutes or days-hours):
#SBATCH --time=20
#
# Standard out and error:
#SBATCH --output=%x-%j.SLURMout
echo "This script comes from ICER's TensorFlow example"
# [Insert command to load your Conda] -- see https://docs.icer.msu.edu/Using_conda/
conda activate tf_Jul2024
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib:/mnt/home/user123/miniforge3/envs/tf_Jul2024/lib/python3.10/site-packages/tensorrt
python matmul.tf2.py
conda deactivate
To submit it, run sbatch testTF.sbatch
from the command line. The final result will be written to the file test_matmul-<jobid>.SLURMout
.