Skip to content

Assembly of PacBio long reads with Canu

Introduction

Canu is used for de novo assembly using long reads, as generated from PacBio or Oxford Nanopore technologies. It consists of three steps: read correction, read trimming and contig assembly.

As of Sept 2021, we have the latest: version 2.2 installed on the HPCC. You can load it by

1
2
3
module purge
module load GCCcore/8.3.0 Java/11 Perl/5.30.0 gnuplot/5.2.8
export PATH=/opt/software/canu/canu-2.2/bin:$PATH

Then, simply running canu will give you a good amount of help information. For example, at the bottom of the help document, we learn that canu supports three types of raw input data:

1
2
3
4
[technology]
-pacbio      <files>
-nanopore    <files>
-pacbio-hifi <files>

While canu can automate job submission using SLURM, we don't recommend this method. Therefore, please specify useGrid=false in the canu command to disable grid support. Users will write a job script manually, treating canu as an ordinary program.

An example using PacBio reads

The PacBio reads we will be assembling are the same as the ones used in the canu tutorial, which can be downloaded using the following command:

1
curl -L -o pacbio.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq

By default, the canu pipeline will correct the reads, trim the reads, and then assemble the reads to contigs. Minimally, you can run canu on a dev-node in the following way (we need to first load all necessary modules):

1
2
3
4
5
module purge
module load GCCcore/8.3.0 Java/11 Perl/5.30.0 gnuplot/5.2.8
export PATH=/opt/software/canu/canu-2.2/bin:$PATH

/bin/time -v canu -p ecoli -d ecoli-pacbio genomeSize=4.8m useGrid=false maxThreads=10 -pacbio pacbio.fastq > runCanu_2021-09-14.log 2>&1 &

Above,

  • pacbio.fastq is the input file, considered as raw and unprocessed reads. Coupled with -pacbio, canu knows which technology has generated these reads.
  • -p: set the file name prefix of intermediate and output files; it's mandatory.
  • -d: set assembly directory name for canu to run in. If not supplied, it'll run in the current directory. It is not possible to run two different assemblies in the same directory.
  • genomeSize: in bases, with common prefixes allowed, such as 4.7m or 2.8g. canu uses it to determine coverage in the input reads.
  • useGrid=false: make canu run on the local machine.
  • maxThreads: the maximum number of threads that each task can use.
  • Finally, we put time -v in front of the canu command in order to get resource usage, which will be shown at the end of the log file runCanu_2021-09-14.log. For example,
    • Maximum resident set size (kbytes): 4113216 tells us that the maximum memory used during the process is about 4G.
    • Percent of CPU this job got: 476% tells us we've used on average 5 CPUs.

If we want to have canu run in the HPCC cluster, we can write a job script accordingly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash

#SBATCH --job-name=canu_ecoli
#SBATCH --cpus-per-task=10
#SBATCH --mem=10G
#SBATCH --time=2:00:00
#SBATCH --output=%x-%j.SLURMout

echo "This script is from ICER's Canu example"

module purge
module load GCCcore/8.3.0 Java/11 Perl/5.30.0 gnuplot/5.2.8
export PATH=/opt/software/canu/canu-2.2/bin:$PATH

/bin/time -v canu -p ecoli -d ecoli-pacbio genomeSize=4.8m useGrid=false maxThreads=10 -pacbio pacbio.fastq > runCanu_2021-09-14.log 2>&1

The canu command is exactly the same as the one we run on the dev-node, except that the trailing & sign should be removed when it is within a job script.

The primary output file for most users is the assembled contigs. In this example, it is ecoli-pacbio/ecoli.contigs.fasta under your current working directory. Refer to this page when you want to learn more about the output, such as the various statistics of the reads analyzed, as reported in the ecoli.report file.

Notes

  • To adjust default parameters, you need to consult the canu parameter reference.
  • The three steps (error correction, trimming and assembly) can be individually run. See this example.
  • If your data is PacBio HiFi reads (i.e. CCS reads with predicted accuracy >= Q20 or 99%), you may want to use the option -pacbio-hifi rather than -pacbio. Canu will skip read correction and trimming in this case.