Skip to content

Authorship

This guide was written by Siddak Marwaha (ICER student intern from MSU Astrophysics and Data Science, Spring 2023).

TensorFlow GPU Usage

Introduction

HPCC provides GPU resources for machine learning tasks. GPUs can accelerate the training and inference of deep learning models, allowing for faster experimentation and better performance. TensorFlow is a popular open-source machine learning framework that supports GPU acceleration. This guide will walk you through the steps of utilizing GPU resources on HPCC using TensorFlow.

Setup

Ensure you have the latest TensorFlow GPU release installed.

1
2
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

TensorFlow can perform computations on different types of devices, including CPUs and GPUs. These devices are identified by specific names, such as /device:CPU:0 for the CPU and /GPU:0 for the first visible GPU, CPU:1 and GPU:1 for the second and so on.

When running TensorFlow operations that have both CPU and GPU implementations, the GPU device is prioritized by default. For example, if you have both a CPU and a GPU available, the GPU will be used to run the operation, unless you specifically request to use the CPU instead. However, if an operation doesn't have a corresponding GPU implementation, then it will fall back to the CPU device. For example, if you have a CPU and a GPU, and you're running an operation that only has a CPU implementation, the operation will run on the CPU even if you requested to run it on the GPU.

Logging device placement

The following code sets a TensorFlow option to log the device used to run each operation, then creates two matrices (a and b) and multiplies them using TensorFlow's built-in matrix multiplication function (tf.matmul). The result of the multiplication is printed to the console. By setting the log option, we can see which device (CPU or GPU) is used to perform the computation:

1
2
3
4
5
6
7
8
tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

The expected result for GPU used is:

1
2
3
4
5
6
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

Manual device placement

If you want to choose a specific device for an operation instead of letting TensorFlow automatically select it for you, you can use the tf.device function. This creates a context where all operations inside it will run on the same device you choose. However, by default TensorFlow will use a GPU if it is available and configured properly.

Again, the following code sets log_device_placement to True, which will cause TensorFlow to print the assigned device for each operation. Then, it places two constant tensors 'a' and 'b' on the CPU using the with tf.device('/CPU:0') block. Finally, it multiplies 'a' and 'b'. This demonstrates how to explicitly place tensors on specific devices and how TensorFlow prioritizes GPU over CPU when both are available:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the GPU
c = tf.matmul(a, b)
print(c)

Limiting TF to certain GPUs

If you want to use only certain GPUs, you can use the tf.config.set_visible_devices method to limit TensorFlow to those GPUs. This can help avoid memory fragmentation and ensure that the specific GPUs you want to use are available.

The following code checks if there are any GPUs available on the system by listing the physical devices. If there are GPUs available, it restricts TensorFlow to only use the first GPU by setting it as the visible device. It also lists the logical devices to confirm the GPU usage. If there are any errors, such as the visible devices being set after the GPUs have already been initialized, it will catch the error and print it. The purpose of this code is to manage the available GPUs and ensure that TensorFlow uses them efficiently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)

Example output:

1
1 Physical GPUs, 1 Logical GPU

Note

Physical vs Logical devices: physical devices refer to the actual hardware components such as a GPU or CPU that are present in the system. On the other hand, logical devices refer to the virtual representations of these physical devices that are exposed to TensorFlow for computation. When TensorFlow is initialized on a machine with GPUs, it detects the available physical devices and creates a logical device for each physical device. Each logical device can have multiple components, such as a GPU with multiple cores, and it may also have a subset of the memory of the physical device. Logical devices are used by TensorFlow to distribute and manage the computation across the available physical devices in an efficient manner.

Using a single GPU on a multi-GPU system

If you have multiple GPUs, TensorFlow will use the one with the lowest ID number by default. If you want to use a different GPU, you need to tell TensorFlow which one to use specifically. The following code attempts to perform a matrix multiplication operation between two TensorFlow constant tensors using a non-existent GPU device /device:GPU:4. Since this device does not exist, it should raise a RuntimeError exception. The code also sets tf.debugging.set_log_device_placement(True) to log the placement of operations on devices. So, if the program runs successfully, it will log which device the operation ran on. If you try to run operations on a specific GPU device that does not exist, you will get a runtime error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tf.debugging.set_log_device_placement(True)

try:
  # Specify an invalid GPU device
  with tf.device('/device:GPU:4'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)
except RuntimeError as e:
  print(e)

You can use tf.config.set_soft_device_placement(True) to instruct TensorFlow to automatically choose a supported device to run the operations in case the specified device is not available. This can help make your code more flexible and robust in case the availability of GPU devices changes over time.

Note

Eager vs Graph execution modes: since TensorFlow 2.0, the eager execution is the default and soft device placement is enabled by default when running in eager mode. Therefore, with the above code snippet running in the eager mode, you won't get an error even without having tf.config.set_soft_device_placement(True). However, for complex model training, graph execution has the advantages of being faster, more flexible, and robust. If you opt to use it, you will need to enable soft device placement.

The first two lines of the following code enable TensorFlow to choose a device to run operations on, and then log where each operation is executed:

1
2
3
4
5
6
7
8
tf.config.set_soft_device_placement(True)
tf.debugging.set_log_device_placement(True)

# Creates some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

Using multiple GPUs

Developing machine learning models to work with multiple GPUs allows the model to use additional resources and potentially scale better. However, if you only have a single GPU available, you can still simulate multiple GPUs using virtual devices. This makes it easier to test and develop for multi-GPU setups without needing additional physical GPUs.

The following code is creating two virtual GPUs with 1GB memory each. It first lists the physical GPUs available on the system using tf.config.list_physical_devices('GPU'). If there are GPUs available, it uses tf.config.set_logical_device_configuration() to create two logical devices with a memory limit of 1024 MB (1GB) each on the first physical GPU. The code then lists the logical GPUs created with tf.config.list_logical_devices('GPU') and prints the number of physical and logical GPUs. If there is an error setting up the virtual devices, the error message is printed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Create 2 virtual GPUs with 1GB memory each
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

Output:

1
1 Physical GPU, 2 Logical GPUs

This output indicates that there is one physical GPU available on the system, and two logical GPUs have been created on that physical GPU using virtual devices. Each logical GPU has a memory limit of 1GB. The code successfully created the virtual devices without any errors.

Once there are multiple logical GPUs available to the runtime, you can utilize the multiple GPUs with tf.distribute.Strategy or with manual placement.

Using tf.distribute.Strategy

The best practice for using multiple GPUs is to use tf.distribute.Strategy. The next code sets up a mirrored strategy for training a neural network model on multiple GPUs. It first enables device placement logging, then lists the logical GPUs available to the runtime. It creates a MirroredStrategy object, which distributes the training across multiple GPUs. The with strategy.scope() block defines the model architecture and compiles it with a mean squared error loss and stochastic gradient descent optimizer:

1
2
3
4
5
6
7
8
9
tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

The program is using multiple GPUs to process the data faster by splitting the input data between the GPUs and running a copy of the model on each GPU. This approach is called "data parallelism".

Manual placement

tf.distribute.Strategy is a tool that allows you to replicate your model on multiple devices, which can improve performance. You can also achieve the same thing manually by building your model on each device. The following program demonstrates how to manually replicate computation across multiple GPUs. It creates copies of a matrix multiplication operation on each available GPU and then adds the results of those computations on the CPU to obtain the final result. It also uses tf.debugging.set_log_device_placement(True) to print the placement of each operation to the console for debugging purposes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
if gpus:
  # Replicate your computation on multiple GPUs
  c = []
  for gpu in gpus:
    with tf.device(gpu.name):
      a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
      b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
      c.append(tf.matmul(a, b))
  with tf.device('/CPU:0'):
    matmul_sum = tf.add_n(c)
  print(matmul_sum)

Example Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:1
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:1
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:1
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:2
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:2
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:2
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:3
Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:3
Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:3
Executing op AddN in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor(
[[ 88. 112.]
 [196. 256.]], shape=(2, 2), dtype=float32)