Assembly of PacBio long reads with Canu
Introduction
Canu is used for de novo assembly using long reads, as generated from PacBio or Oxford Nanopore technologies. It consists of three steps: read correction, read trimming and contig assembly.
As of Sept 2021, we have the latest: version 2.2 installed on the HPCC. You can load it by
1 2 3 |
|
Then, simply running canu
will give you a good amount of help
information. For example, at the bottom of the help document, we learn
that canu supports three types of raw input data:
1 2 3 4 |
|
While canu can automate job submission using SLURM, we don't recommend
this method. Therefore, please specify useGrid=false
in the canu
command to disable grid support. Users will write a job script manually,
treating canu as an ordinary program.
An example using PacBio reads
The PacBio reads we will be assembling are the same as the ones used in the canu tutorial, which can be downloaded using the following command:
1 |
|
By default, the canu pipeline will correct the reads, trim the reads, and then assemble the reads to contigs. Minimally, you can run canu on a dev-node in the following way (we need to first load all necessary modules):
1 2 3 4 5 |
|
Above,
pacbio.fastq
is the input file, considered as raw and unprocessed reads. Coupled with-pacbio
, canu knows which technology has generated these reads.-p
: set the file name prefix of intermediate and output files; it's mandatory.-d
: set assembly directory name for canu to run in. If not supplied, it'll run in the current directory. It is not possible to run two different assemblies in the same directory.genomeSize
: in bases, with common prefixes allowed, such as 4.7m or 2.8g. canu uses it to determine coverage in the input reads.useGrid=false
: make canu run on the local machine.maxThreads
: the maximum number of threads that each task can use.- Finally, we put
time -v
in front of the canu command in order to get resource usage, which will be shown at the end of the log filerunCanu_2021-09-14.log
. For example,Maximum resident set size (kbytes): 4113216
tells us that the maximum memory used during the process is about 4G.Percent of CPU this job got: 476%
tells us we've used on average 5 CPUs.
If we want to have canu run in the HPCC cluster, we can write a job script accordingly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
The canu command is exactly the same as the one we run on the dev-node,
except that the trailing &
sign should be removed when it is within a
job script.
The primary output file for most users is the assembled contigs. In this
example, it is ecoli-pacbio/ecoli.contigs.fasta
under your current
working directory. Refer to
this page when you want to learn more about the
output, such as the various statistics of the reads analyzed, as
reported in the ecoli.report
file.
Notes
- To adjust default parameters, you need to consult the canu parameter reference.
- The three steps (error correction, trimming and assembly) can be individually run. See this example.
- If your data is
PacBio HiFi reads (i.e.
CCS reads with predicted accuracy >= Q20 or 99%), you may want to
use the option
-pacbio-hifi
rather than-pacbio
. Canu will skip read correction and trimming in this case.