BLAST/BLAST+ with Multiple Processors
Overview
It is possible to run BLAST or BLAST+ on the HPCC in multi-threaded mode. This is advantageous in that is allows users to leverage multiple processors to complete their BLAST searches, thereby decreasing compute time.
To load BLAST or BLAST+ on the HPCC:
1 2 3 4 5 6 7 |
|
Multi-Threading vs. MPI
Multi-threaded BLAST runs enable the user to launch multiple worker threads on a single node. However, because standard BLAST and BLAST+ do not use distributed memory, you cannot accomplish multi-threaded runs across multiple nodes. Therefore, users executing multi-threaded BLAST or BLAST+ runs should not reserve more than one node, as this will reserve hardware resources that cannot be used.
Job Submission Guidelines
First, we need to differentiate between traditional NCBI BLAST and BLAST+. Traditional NCBI BLAST utilizes the "-a #" flag to specify the number of processors to use for the job (default is 1). BLAST+ uses the "-num_threads #" flag to specify the number of worker threads to use. Depending upon which type of BLAST you use, you will need to adjust your job submission script parameters accordingly.
Traditional BLAST
Using the "-a" flag in BLAST will specify the number of processors to use. To reserve the appropriate quantity of resources in your job submission script, you will need to reserve a number of cores equal to the value specified by the "-a" flag For example, if you used a command like:
1 |
|
You should specify something like the following in your SLURM job submission script:
1 |
|
BLAST+
In contrast, BLAST+ uses the "-num_threads" flag to specify the number of worker threads to create. In order to specify the correct number of cores for the job, you will need to ADD ONE to the number of threads specified. This is to account for the number of worker threads, PLUS the main process thread. So if you used an equivalent BLAST+ command like:
1 |
|
You should use the following in your SLURM script:
1 |
|
BLASTDB
The BLASTDB environmental variable tells BLAST or BLAST+ where to find your databases that can be searched. On the HPCC, we offer select BLAST-ready data sets for this purpose in a common read-only area. BLAST data sets can be accessed at:
1 |
|
If you are using the FASTA sequences instead of nucleotide data sets, you need to augment the path above as follows:
1 |
|
For cluster jobs, you will need to set the value of BLASTDB in your job submission script, for example:
1 |
|
A Word About Memory
In either case (BLAST or BLAST+) your requested memory (in the examples above, 4gb) will be divided amongst all of your task threads. Plan accordingly.
BLAST data preparation
Data downloaded from the NCBI website, or prepared by users can, in most cases, be easily converted for use with BLAST. This brief tutorial is designed to illustrate a fairly basic scenario where the user wants to download a set of FASTA sequences from the NCBI website and prepare them for BLAST-ing.
Download
The simplest way to do this is to note the link of the FASTA file, and use either the "wget" or "curl" command. For example:
1 |
|
or
1 |
|
This will download the file "Ta.seq.all.gz" into the current directory. Now unzip the file:
1 |
|
This will leave a file called "Ta.seq.all" in your directory.
Preparing the Indices
To prepare the BLAST indices for nucleotides:
1 |
|
The command above will produce several files, such as:
1 2 3 |
|
If you want to produce protein indices instead of, or in addition to nucleotides, run:
1 |
|
In this case, this will produce the files:
1 2 3 |
|
You can verify whether your BLAST formatting was successful by looking at the "formatdb.log" file which should now be present in your directory.