Skip to content

Submitting a TensorFlow job

We assume that you've installed your TensorFlow virtual environment on a GPU dev-node (dev-intel16-k80). The Python code we are going to run in the SLURM job script is named matmul.py, with content being

matmul.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import sys
import numpy as np
import tensorflow as tf
from datetime import datetime

device_name = sys.argv[1]  # Choose device from cmd line. Options: gpu or cpu
shape = (int(sys.argv[2]), int(sys.argv[2]))
if device_name == "gpu":
    device_name = "/gpu:6"
else:
    device_name = "/cpu:0"

with tf.device(device_name):
    random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
    dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
    sum_operation = tf.reduce_sum(dot_operation)

startTime = datetime.now()
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as session:
        result = session.run(sum_operation)
        print(result)

print("\n" * 5)
print("Shape:", shape, "Device:", device_name)
print("Time taken:", datetime.now() - startTime)
print("\n" * 5)

Now we write a SLURM script to submit the job to the cluster:

Submitting GPU TensorFlow job: test_matmul.sbatch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash

#SBATCH --job-name=test_matmul
#SBATCH --gres=gpu:1
#SBATCH --mem-per-cpu=20G
#SBATCH --time=20
#SBATCH --output=%x-%j.SLURMout

echo $CUDA_VISIBLE_DEVICES

module purge
module load GCC/6.4.0-2.28  OpenMPI/2.1.2
module load CUDA/10.0.130 cuDNN/7.5.0.56-CUDA-10.0.130
module load Python/3.6.4
source ~/tf-1.13.1-env/bin/activate
export TF_CPP_MIN_LOG_LEVEL=2 # disables the warning, doesn't enable AVX/FMA.

srun python matmul.py gpu 1500

To submit it, simply run

1
sbatch test_matmul.sbatch

The final result will be written to file "test_matmul-<jobid>.SLURMout".