Submitting a TensorFlow job

After you've installed TF in your conda environment, we can submit a TF job to the cluster. In order to make use of GPU computing, we'll need to request for a GPU node in the job script through the --gpu directive.

As an example, the TF python code, matmul.tf2.py, is shown below:

import tensorflow as tf
tf.debugging.set_log_device_placement(True)
tf.config.set_soft_device_placement(True)

with tf.device('/device:GPU:2'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
print(c)

Because tf.config.set_soft_device_placement is turned on, even if this code is assigned a CPU-only node, it will still run. The multiplication step will be carried out using the CPU.

Now, let's write our SLURM job script, testTF.sbatch, which contains the following:

#!/bin/bash

# Job name:
#SBATCH --job-name=test_matmul
#
# Request GPU:
#SBATCH --gpus=v100:1
#
# Memory:
#SBATCH --mem-per-cpu=20G
#
# Wall clock limit (minutes or hours:minutes or days-hours):
#SBATCH --time=20
#
# Standard out and error:
#SBATCH --output=%x-%j.SLURMout

export PATH=/mnt/home/user123/anaconda3/bin:$PATH # this is just an example PATH; use your own conda installation

conda activate tf_gpu_Feb2023 # again, activate your own TF env

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/lib/:/lib64/:$CONDA_PREFIX/lib/:$CONDA_PREFIX/lib/python3.9/site-packages/tensorrt_libs
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib

python matmul.tf2.py

conda deactivate

To submit it, run sbatch testTF.sbatch from the command line. The final result will be written to the file test_matmul-<jobid>.SLURMout.