BLAST/BLAST+ with Multiple Processors

Overview

It is possible to run BLAST or BLAST+ on the HPCC in multi-threaded mode. This is advantageous in that is allows users to leverage multiple processors to complete their BLAST searches, thereby decreasing compute time.

To load BLAST or BLAST+ on the HPCC:

# Loading BLAST
module purge
module load BLAST/2.2.26-Linux_x86_64

# Loading BLAST+
module purge
module load icc/2017.4.196-GCC-6.4.0-2.28  impi/2017.3.196 BLAST+/2.8.1-Python-2.7.14

Multi-Threading vs. MPI

Multi-threaded BLAST runs enable the user to launch multiple worker threads on a single node. However, because standard BLAST and BLAST+ do not use distributed memory, you cannot accomplish multi-threaded runs across multiple nodes. Therefore, users executing multi-threaded BLAST or BLAST+ runs should not reserve more than one node, as this will reserve hardware resources that cannot be used.

Job Submission Guidelines

First, we need to differentiate between traditional NCBI BLAST and BLAST+. Traditional NCBI BLAST utilizes the "-a #" flag to specify the number of processors to use for the job (default is 1). BLAST+ uses the "-num_threads #" flag to specify the number of worker threads to use. Depending upon which type of BLAST you use, you will need to adjust your job submission script parameters accordingly.

Traditional BLAST

Using the "-a" flag in BLAST will specify the number of processors to use. To reserve the appropriate quantity of resources in your job submission script, you will need to reserve a number of cores equal to the value specified by the "-a" flag For example, if you used a command like:

blastall -p blastp -d swissprot -i prot.fasta -o test1.blast -e 0.001 -a 4

You should specify something like the following in your SLURM job submission script:

#SBATCH --cpus-per-task=4

BLAST+

In contrast, BLAST+ uses the "-num_threads" flag to specify the number of worker threads to create. In order to specify the correct number of cores for the job, you will need to ADD ONE to the number of threads specified. This is to account for the number of worker threads, PLUS the main process thread. So if you used an equivalent BLAST+ command like:

blastn -task blastn -db swissprot -query prot.fasta -out test1.blast -evalue 0.001 -num_threads 4

You should use the following in your SLURM script:

#SBATCH --cpus-per-task=5

BLASTDB

The BLASTDB environmental variable tells BLAST or BLAST+ where to find your databases that can be searched. On the HPCC, we offer select BLAST-ready data sets for this purpose in a common read-only area. BLAST data sets can be accessed at:

/mnt/research/common-data/Bio/blastdb

If you are using the FASTA sequences instead of nucleotide data sets, you need to augment the path above as follows:

/mnt/research/common-data/Bio/blastdb/FASTA

For cluster jobs, you will need to set the value of BLASTDB in your job submission script, for example:

export BLASTDB=/mnt/research/common-data/Bio/blastdb:/mnt/research/common-data/Bio/blastdb/FASTA:$BLASTDB

A Word About Memory

In either case (BLAST or BLAST+) your requested memory (in the examples above, 4gb) will be divided amongst all of your task threads. Plan accordingly.

BLAST data preparation

Data downloaded from the NCBI website, or prepared by users can, in most cases, be easily converted for use with BLAST. This brief tutorial is designed to illustrate a fairly basic scenario where the user wants to download a set of FASTA sequences from the NCBI website and prepare them for BLAST-ing.

Download

The simplest way to do this is to note the link of the FASTA file, and use either the "wget" or "curl" command. For example:

wget ftp://ftp.ncbi.nih.gov/repository/UniGene/Triticum_aestivum/Ta.seq.all.gz

or

curl -O ftp://ftp.ncbi.nih.gov/repository/UniGene/Triticum_aestivum/Ta.seq.all.gz

This will download the file "Ta.seq.all.gz" into the current directory. Now unzip the file:

gunzip Ta.seq.all.gz

This will leave a file called "Ta.seq.all" in your directory.

Preparing the Indices

To prepare the BLAST indices for nucleotides:

formatdb -i Ta.seq.all -p F

The command above will produce several files, such as:

Ta.seq.all.fa.nhr
Ta.seq.all.fa.nin
Ta.seq.all.fa.nsq

If you want to produce protein indices instead of, or in addition to nucleotides, run:

formatdb -i Ta.seq.all -p T

In this case, this will produce the files:

Ta.seq.all.fa.phr
Ta.seq.all.fa.pin
Ta.seq.all.fa.psq

You can verify whether your BLAST formatting was successful by looking at the "formatdb.log" file which should now be present in your directory.