After you've installed TF in your conda environment, we can submit a TF job to the cluster. In order to make use of GPU computing, we'll need to request for a GPU node in the job script through the --gres directive.
As an example, the TF python code, matmul.tf2.py, is shown below:
Because tf.config.set_soft_device_placement is turned on, even if this code is assigned a CPU-only node, it will still run. The multiplication step will be carried out using the CPU.
Now, let's write our SLURM job script, testTF.sbatch, which contains the following:
#!/bin/bash# Job name:#SBATCH --job-name=test_matmul## Request GPU:#SBATCH --gres=gpu:v100:1## Memory:#SBATCH --mem-per-cpu=20G## Wall clock limit (minutes or hours:minutes or days-hours):#SBATCH --time=20## Standard out and error:#SBATCH --output=%x-%j.SLURMoutexportPATH=/mnt/home/user123/anaconda3/bin:$PATH# this is just an example PATH; use your own conda installation
condaactivatetf_gpu_Feb2023# again, activate your own TF envexportLD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CONDA_PREFIX/lib/python3.9/site-packages/tensorrt
exportXLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib
pythonmatmul.tf2.py
condadeactivate
To submit it, run sbatch testTF.sbatch from the command line. The final result will be written to the file test_matmul-<jobid>.SLURMout.