Using BLAS and FlexiBLAS to speed up linear algebra operations
What is BLAS?
Overview
BLAS (or Basic Linear Algebra Subprograms) is used to perform linear algebra operations (like vector addition, matrix multiplication, etc) on computers. Since these operations are the basic building blocks of most algorithms, how they work can dramatically impact the speed of most programs.
Implementations
BLAS is actually a "specification", meaning it is only list of possible operations and a common syntax for calling them. There are then different "implementations" of BLAS that determine how the operations are done and are generally highly-optimized. Commonly used implementations include:
Each of these implementations approaches optimizing linear algebra operations differently, and can obtain different performance results in different situations. There is also a "reference implementation" of BLAS provided by Netlib, the organization which creates the BLAS specification. This reference implementation is meant to demonstrate how a BLAS implementation should work rather than being performant and is not recommended in practice.
BLAS implementations are usually installed as a static (.a
file) or dynamic (or .so
file) library that other programs can then "link to" when they are compiled.
How BLAS is used in practice
Code can be written that interfaces with BLAS directly, but it's usually more common to use a piece of code that calls the BLAS for you. For example, the numpy
package in Python performs its linear algebra operations with a BLAS implementation when available. Similarly, most linear algebra operations in R and MATLAB call a BLAS implementation behind the scenes.
Note that even if you're not doing linear algebra operations directly, there's a very good chance that code you're using does somewhere along the way!
What is FlexiBLAS?
FlexiBLAS is not a BLAS implementation, but is an easy way to switch between implementations.
Compiling with BLAS
When a piece of code uses BLAS, it usually needs to link to an implementation. This example would link mycode.c
on the HPCC to the OpenBLAS dynamic library:
module load GCC OpenBLAS
gcc -lopenblas mycode.c -o mycode
You could also compile the same code with the Intel Math Kernel Library (or IMKL) (note the extra options which are provided by Intel):
module load GCC imkl
gcc -m64 -lmkl_rt -Wl,--no-as-needed -lpthread -lm -ldl mycode.c -o mycode
The compiled code is then fixed to use that implementation of BLAS.
Compiling with FlexiBLAS
FlexiBLAS lets you swap between multiple implementations. This lets you compile the code once with FlexiBLAS as the "implementation", which is really just a layer in the middle that points to the real implementation in the background:
module load GCC FlexiBLAS
gcc -lflexiblas mycode.c -o mycode
Using FlexiBLAS
When code is compiled with FlexiBLAS, you can swap BLAS implementations (or as FlexiBLAS calls them, "backends") out at runtime. For more information about using FlexiBLAS not covered here, try running flexiblas help
or see the FlexiBLAS documentation.
Listing available BLAS backends
When ICER installs FlexiBLAS and different BLAS implementations, we tell FlexiBLAS what backends are available. These can be seen with
flexiblas list
System-wide:
System-wide (config directory):
BLIS
library = libflexiblas_blis.so
comment =
IMKL
library = libflexiblas_imkl_gnu_thread.so
comment =
IMKL_SEQ
library = libflexiblas_imkl_sequential.so
comment =
NETLIB
library = libflexiblas_netlib.so
comment =
OPENBLAS
library = libflexiblas_openblas.so
comment =
User config:
Host config:
Environment config:
The default backend is set by ICER to be OPENBLAS
.
Swapping BLAS backends at runtime
Assuming the mycode
executable has been compiled with FlexiBLAS, running it as is will use the default backend (OpenBLAS):
./mycode
To use a different backend, set the environment variable FLEXIBLAS
to that backend name:
FLEXIBLAS=IMKL ./mycode
Make sure the module containing that implementation is loaded
FlexiBLAS can't tell whether an backend is actually installed and working until it tries to run. In the above example using FLEXIBLAS=IMKL
, you would need to manually load the imkl
module that provides the correct libraries beforehand with
module load imkl
If the BLAS backend you request is not loaded, FlexiBLAS will fall back to the (very slow) reference implementation.
You can also pass in the location of a BLAS implementation dynamic library file:
FLEXIBLAS=myblas.so ./mycode
Setting a default FlexiBLAS backend
If you prefer to use a different backend of BLAS than OpenBLAS at all times, you can set your local configuration with:
flexiblas default <IMPLEMENTATION>
For example, using
flexiblas default IMKL
will always use the Intel Math Kernel Library. Note that you still need to load the imkl
module so that FlexiBLAS can find the library (see the note above)!
Benchmarking BLAS Implementations with FlexiBLAS
Using Python
As mentioned above, numpy
's linear algebra operations are performed using BLAS. ICER has installed numpy
with FlexiBLAS, so you can choose your BLAS backend.
Benchmark script
The results below are run on dev-amd20
and computed using the following script:
#!/bin/bash
module purge
module load SciPy-bundle imkl BLIS
BACKENDS="OPENBLAS BLIS IMKL IMKL_SEQ NETLIB"
for BACKEND in $BACKENDS; do
echo $BACKEND
export FLEXIBLAS=$BACKEND
time python -c "import numpy as np; x = np.random.randn(5000, 5000); x @ x.T"
done
This script computes the cross product (the matrix times its transpose) of a 5000-by-5000 matrix of normally distributed random values. Note that we need to load the imkl
and BLIS
modules in addition to SciPy-bundle
to use them with FlexiBLAS. FlexiBLAS
and its default backend OpenBLAS
are loaded as dependencies of SciPy-bundle
and do not need to be specified.
Results
The output from one run of the script is copied below
OPENBLAS
real 0m6.410s
user 0m6.159s
sys 0m0.170s
BLIS
real 0m6.733s
user 0m6.587s
sys 0m0.132s
IMKL
real 0m1.606s
user 0m7.572s
sys 0m0.338s
IMKL_SEQ
real 0m6.633s
user 0m6.475s
sys 0m0.116s
NETLIB
real 1m2.583s
user 1m2.442s
sys 0m0.120s
The real
time is the time that you would wait for the code to run, whereas the user
time is the time that that the process was running on the CPUs. Note that the real
time can be less than the user
time when code is run in parallel (as in the case of IMKL
).
The elapsed
time shows that for this specific cross-product example, IMKL
is fastest, OPENBLAS
, BLIS
, and IMKL_SEQ
(the non-parallel version of IMKL
) are comparable, and NETLIB
is much slower.
Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.
For example, BLIS
allows for parallelization similar to IMKL
with an extra command line argument that produces a similar result:
export BLIS_NUM_THREADS=8 && export FLEXIBLAS=BLIS && time python -c "import numpy as np; x = np.random.randn(5000, 5000); x @ x.T"
real 0m2.010s
user 0m7.302s
sys 0m0.202s
Using R
As mentioned above, R's linear algebra operations are performed using BLAS. ICER has installed R with FlexiBLAS, so you can choose your BLAS backend.
Benchmark script
The results below are run on dev-amd20
and computed using the following script:
#!/bin/bash
module purge
module load R imkl BLIS
BACKENDS="OPENBLAS BLIS IMKL IMKL_SEQ NETLIB"
for BACKEND in $BACKENDS; do
echo $BACKEND
FLEXIBLAS=$BACKEND Rscript -e "system.time({x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })"
done
This script computes the cross product (the matrix times its transpose) of a 5000-by-5000 matrix of normally distributed random values. Note that we need to load the imkl
and BLIS
modules in addition to R
to use them with FlexiBLAS. FlexiBLAS
and its default backend OpenBLAS
are loaded as dependencies of R and do not need to be specified.
Results
The output from one run of the script is copied below
OPENBLAS
user system elapsed
7.512 0.497 8.009
BLIS
user system elapsed
7.728 0.307 8.035
IMKL
user system elapsed
9.158 0.451 3.119
IMKL_SEQ
user system elapsed
7.765 0.312 8.078
NETLIB
user system elapsed
50.242 0.481 50.732
The elapsed
time is the time that you would wait for the code to run, whereas the user
time is the time that that the process was running on the CPUs. Note that the elapsed
time can be less than the user
time when code is run in parallel (as in the case of IMKL
).
The elapsed
time shows that for this specific cross-product example, IMKL
is fastest, OPENBLAS
, BLIS
, and IMKL_SEQ
(the non-parallel version of IMKL
) are comparable, and NETLIB
is much slower.
Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.
For example, BLIS
allows for parallelization similar to IMKL
with an extra command line argument that produces a similar result:
BLIS_NUM_THREADS=8 FLEXIBLAS=BLIS Rscript -e "system.time({x <- replicate(5e3, rnorm(5e3)); tcrossprod(x) })"
user system elapsed
8.769 0.369 3.381