Using BLAS and FlexiBLAS to speed up linear algebra operations
What is BLAS?
Overview
BLAS (or Basic Linear Algebra Subprograms) is used to perform linear algebra operations (like vector addition, matrix multiplication, etc) on computers. Since these operations are the basic building blocks of most algorithms, how they work can dramatically impact the speed of most programs.
Implementations
BLAS is actually a "specification", meaning it is only list of possible operations and a common syntax for calling them. There are then different "implementations" of BLAS that determine how the operations are done and are generally highlyoptimized. Commonly used implementations include:
Each of these implementations approaches optimizing linear algebra operations differently, and can obtain different performance results in different situations. There is also a "reference implementation" of BLAS provided by Netlib, the organization which creates the BLAS specification. This reference implementation is meant to demonstrate how a BLAS implementation should work rather than being performant and is not recommended in practice.
BLAS implementations are usually installed as a static (.a
file) or dynamic (or .so
file) library that other programs can then "link to" when they are compiled.
How BLAS is used in practice
Code can be written that interfaces with BLAS directly, but it's usually more common to use a piece of code that calls the BLAS for you. For example, the numpy
package in Python performs its linear algebra operations with a BLAS implementation when available. Similarly, most linear algebra operations in R and MATLAB call a BLAS implementation behind the scenes.
Note that even if you're not doing linear algebra operations directly, there's a very good chance that code you're using does somewhere along the way!
What is FlexiBLAS?
FlexiBLAS is not a BLAS implementation, but is an easy way to switch between implementations.
Compiling with BLAS
When a piece of code uses BLAS, it usually needs to link to an implementation. This example would link mycode.c
on the HPCC to the OpenBLAS dynamic library:
1 2 

You could also compile the same code with the Intel Math Kernel Library (or IMKL) (note the extra options which are provided by Intel):
1 2 

The compiled code is then fixed to use that implementation of BLAS.
Compiling with FlexiBLAS
FlexiBLAS lets you swap between multiple implementations. This lets you compile the code once with FlexiBLAS as the "implementation", which is really just a layer in the middle that points to the real implementation in the background:
1 2 

Using FlexiBLAS
When code is compiled with FlexiBLAS, you can swap BLAS implementations (or as FlexiBLAS calls them, "backends") out at runtime. For more information about using FlexiBLAS not covered here, try running flexiblas help
or see the FlexiBLAS documentation.
Listing available BLAS backends
When ICER installs FlexiBLAS and different BLAS implementations, we tell FlexiBLAS what backends are available. These can be seen with
input  

1 

output  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

The default backend is set by ICER to be OPENBLAS
.
Swapping BLAS backends at runtime
Assuming the mycode
executable has been compiled with FlexiBLAS, running it as is will use the default backend (OpenBLAS):
1 

To use a different backend, set the environment variable FLEXIBLAS
to that backend name:
1 

Make sure the module containing that implementation is loaded
FlexiBLAS can't tell whether an backend is actually installed and working until it tries to run. In the above example using FLEXIBLAS=IMKL
, you would need to manually load the imkl
module that provides the correct libraries beforehand with
1 

If the BLAS backend you request is not loaded, FlexiBLAS will fall back to the (very slow) reference implementation.
You can also pass in the location of a BLAS implementation dynamic library file:
1 

Setting a default FlexiBLAS backend
If you prefer to use a different backend of BLAS than OpenBLAS at all times, you can set your local configuration with:
1 

For example, using
1 

will always use the Intel Math Kernel Library. Note that you still need to load the imkl
module so that FlexiBLAS can find the library (see the note above)!
Benchmarking BLAS Implementations with FlexiBLAS
Using Python
As mentioned above, numpy
's linear algebra operations are performed using BLAS. ICER has installed numpy
with FlexiBLAS, so you can choose your BLAS backend.
Benchmark script
The results below are run on devintel14k20
and computed using the following script:
1 2 3 4 5 6 7 8 9 10 11 12 

This script computes the cross product (the matrix times its transpose) of a 5000by5000 matrix of normally distributed random values. Note that we need to load the imkl
and BLIS
modules in addition to SciPybundle
to use them with FlexiBLAS. FlexiBLAS
and its default backend OpenBLAS
are loaded as dependencies of SciPybundle
and do not need to be specified.
Results
The output from one run of the script is copied below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 

The real
time is the time that you would wait for the code to run, whereas the user
time is the time that that the process was running on the CPUs. Note that the real
time can be less than the user
time when code is run in parallel (as in the case of IMKL
).
The elapsed
time shows that for this specific crossproduct example, IMKL
is fastest, OPENBLAS
, BLIS
, and IMKL_SEQ
(the nonparallel version of IMKL
) are comparable, and NETLIB
is much slower.
Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.
For example, BLIS
allows for parallelization similar to IMKL
with an extra command line argument that produces a similar result:
input  

1 

output  

1 2 3 

Using R
As mentioned above, R's linear algebra operations are performed using BLAS. ICER has installed R with FlexiBLAS, so you can choose your BLAS backend.
Benchmark script
The results below are run on devintel14k20
and computed using the following script:
1 2 3 4 5 6 7 8 9 10 11 

This script computes the cross product (the matrix times its transpose) of a 5000by5000 matrix of normally distributed random values. Note that we need to load the imkl
and BLIS
modules in addition to R
to use them with FlexiBLAS. FlexiBLAS
and its default backend OpenBLAS
are loaded as dependencies of R and do not need to be specified.
Results
The output from one run of the script is copied below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The elapsed
time is the time that you would wait for the code to run, whereas the user
time is the time that that the process was running on the CPUs. Note that the elapsed
time can be less than the user
time when code is run in parallel (as in the case of IMKL
).
The elapsed
time shows that for this specific crossproduct example, IMKL
is fastest, OPENBLAS
, BLIS
, and IMKL_SEQ
(the nonparallel version of IMKL
) are comparable, and NETLIB
is much slower.
Note that this is only one small example with no extra configuration and results will vary by the type of operation and hardware used. We encourage you to explore more about the benefits of each backend and how they may be configured for your situation and experiment with different options.
For example, BLIS
allows for parallelization similar to IMKL
with an extra command line argument that produces a similar result:
input  

1 

output  

1 2 
