Using BLAS and FlexiBLAS to speed up linear algebra operations

What is BLAS?

Overview

BLAS (or Basic Linear Algebra Subprograms) is used to perform linear algebra operations (like vector addition, matrix multiplication, etc) on computers. Since these operations are the basic building blocks of most algorithms, how they work can dramatically impact the speed of most programs.

Implementations

BLAS is actually a "specification", meaning it is only list of possible operations and a common syntax for calling them. There are then different "implementations" of BLAS that determine how the operations are done and are generally highly-optimized. Commonly used implementations include:

Each of these implementations approaches optimizing linear algebra operations differently, and can obtain different performance results in different situations. There is also a "reference implementation" of BLAS provided by Netlib, the organization which creates the BLAS specification. This reference implementation is meant to demonstrate how a BLAS implementation should work rather than being performant and is not recommended in practice.

BLAS implementations are usually installed as a static (.a file) or dynamic (or .so file) library that other programs can then "link to" when they are compiled.

How BLAS is used in practice

Code can be written that interfaces with BLAS directly, but it's usually more common to use a piece of code that calls the BLAS for you. For example, the numpy package in Python performs its linear algebra operations with a BLAS implementation when available. Similarly, most linear algebra operations in R and MATLAB call a BLAS implementation behind the scenes.

Note that even if you're not doing linear algebra operations directly, there's a very good chance that code you're using does somewhere along the way!

What is FlexiBLAS?

FlexiBLAS is not a BLAS implementation, but is an easy way to switch between implementations.

Compiling with BLAS

When a piece of code uses BLAS, it usually needs to link to an implementation. This example would link mycode.c on the HPCC to the OpenBLAS dynamic library:

module load GCC OpenBLAS
gcc -lopenblas mycode.c -o mycode

You could also compile the same code with the Intel Math Kernel Library (or IMKL) (note the extra options which are provided by Intel):

module load GCC imkl
gcc -m64  -lmkl_rt -Wl,--no-as-needed -lpthread -lm -ldl mycode.c -o mycode

The compiled code is then fixed to use that implementation of BLAS.

Compiling with FlexiBLAS

FlexiBLAS lets you swap between multiple implementations. This lets you compile the code once with FlexiBLAS as the "implementation", which is really just a layer in the middle that points to the real implementation in the background:

module load GCC FlexiBLAS
gcc -lflexiblas mycode.c -o mycode

Using FlexiBLAS

When code is compiled with FlexiBLAS, you can swap BLAS implementations (or as FlexiBLAS calls them, "backends") out at runtime. For more information about using FlexiBLAS not covered here, try running flexiblas help or see the FlexiBLAS documentation.

Listing available BLAS backends

When ICER installs FlexiBLAS and different BLAS implementations, we tell FlexiBLAS what backends are available. These can be seen with

input

flexiblas list

output

System-wide:
System-wide (config directory):
 BLIS
   library = libflexiblas_blis.so
   comment = 
 IMKL
   library = libflexiblas_imkl_gnu_thread.so
   comment = 
 IMKL_SEQ
   library = libflexiblas_imkl_sequential.so
   comment = 
 NETLIB
   library = libflexiblas_netlib.so
   comment = 
 OPENBLAS
   library = libflexiblas_openblas.so
   comment = 
User config:
Host config:
Environment config:

The default backend is set by ICER to be OPENBLAS.

Swapping BLAS backends at runtime

Assuming the mycode executable has been compiled with FlexiBLAS, running it as is will use the default backend (OpenBLAS):

./mycode

To use a different backend, set the environment variable FLEXIBLAS to that backend name:

FLEXIBLAS=IMKL ./mycode

Make sure the module containing that implementation is loaded

FlexiBLAS can't tell whether an backend is actually installed and working until it tries to run. In the above example using FLEXIBLAS=IMKL, you would need to manually load the imkl module that provides the correct libraries beforehand with

module load imkl

If the BLAS backend you request is not loaded, FlexiBLAS will fall back to the (very slow) reference implementation.

You can also pass in the location of a BLAS implementation dynamic library file:

FLEXIBLAS=myblas.so ./mycode

Setting a default FlexiBLAS backend

If you prefer to use a different backend of BLAS than OpenBLAS at all times, you can set your local configuration with:

flexiblas default <IMPLEMENTATION>

For example, using

flexiblas default IMKL

will always use the Intel Math Kernel Library. Note that you still need to load the imkl module so that FlexiBLAS can find the library (see the note above)!

Benchmarking BLAS Implementations with FlexiBLAS

Using Python

As mentioned above, numpy's linear algebra operations are performed using BLAS. ICER has installed numpy with FlexiBLAS, so you can choose your BLAS backend.

Benchmark script

The results below are run on dev-amd20 and computed using the following script:

#!/bin/bash

module purge
module load SciPy-bundle imkl BLIS

BACKENDS="OPENBLAS BLIS IMKL IMKL_SEQ NETLIB"

for BACKEND in $BACKENDS; do
    echo $BACKEND
    export FLEXIBLAS=$BACKEND 
    time python -c "import numpy as np; x = np.random.randn(5000, 5000); x @ x.T"
done

This script computes the cross product (the matrix times its transpose) of a 5000-by-5000 matrix of normally distributed random values. Note that we need to load the imkl and BLIS modules in addition to SciPy-bundle to use them with FlexiBLAS. FlexiBLAS and its default backend OpenBLAS are loaded as dependencies of SciPy-bundle and do not need to be specified.

Results

The output from one run of the script is copied below

OPENBLAS

real    0m6.410s
user    0m6.159s
sys 0m0.170s

BLIS

real    0m6.733s
user    0m6.587s
sys 0m0.132s

IMKL

real    0m1.606s
user    0m7.572s
sys 0m0.338s

IMKL_SEQ

real    0m6.633s
user    0m6.475s
sys 0m0.116s

NETLIB

real    1m2.583s
user    1m2.442s
sys 0m0.120s

The real time is the time that you would wait for the code to run, whereas the user time is the time that that the process was running on the CPUs. Note that the real time can be less than the user time when code is run in parallel (as in the case of IMKL).

The elapsed time shows that for this specific cross-product example, IMKL is fastest, OPENBLAS, BLIS, and IMKL_SEQ (the non-parallel version of IMKL) are comparable, and NETLIB is much slower.

Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.

For example, BLIS allows for parallelization similar to IMKL with an extra command line argument that produces a similar result:

input

export BLIS_NUM_THREADS=8 && export FLEXIBLAS=BLIS && time python -c "import numpy as np; x = np.random.randn(5000, 5000); x @ x.T"

output

real    0m2.010s
user    0m7.302s
sys 0m0.202s

Using R

As mentioned above, R's linear algebra operations are performed using BLAS. ICER has installed R with FlexiBLAS, so you can choose your BLAS backend.

Benchmark script

The results below are run on dev-amd20 and computed using the following script:

#!/bin/bash

module purge
module load R imkl BLIS

BACKENDS="OPENBLAS BLIS IMKL IMKL_SEQ NETLIB"

for BACKEND in $BACKENDS; do
    echo $BACKEND
    FLEXIBLAS=$BACKEND Rscript -e "system.time({x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })"
done

This script computes the cross product (the matrix times its transpose) of a 5000-by-5000 matrix of normally distributed random values. Note that we need to load the imkl and BLIS modules in addition to R to use them with FlexiBLAS. FlexiBLAS and its default backend OpenBLAS are loaded as dependencies of R and do not need to be specified.

Results

The output from one run of the script is copied below

OPENBLAS
   user  system elapsed
  7.512   0.497   8.009
BLIS
   user  system elapsed
  7.728   0.307   8.035
IMKL
   user  system elapsed
  9.158   0.451   3.119
IMKL_SEQ
   user  system elapsed
  7.765   0.312   8.078
NETLIB
   user  system elapsed
 50.242   0.481  50.732

The elapsed time is the time that you would wait for the code to run, whereas the user time is the time that that the process was running on the CPUs. Note that the elapsed time can be less than the user time when code is run in parallel (as in the case of IMKL).

The elapsed time shows that for this specific cross-product example, IMKL is fastest, OPENBLAS, BLIS, and IMKL_SEQ (the non-parallel version of IMKL) are comparable, and NETLIB is much slower.

Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.

For example, BLIS allows for parallelization similar to IMKL with an extra command line argument that produces a similar result:

input

BLIS_NUM_THREADS=8 FLEXIBLAS=BLIS Rscript -e "system.time({x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })"

output

   user  system elapsed 
  8.769   0.369   3.381