Skip to content

Trinity for RNA-seq de novo assembly

Loading module

Take loading Trinity 2.6.6 as an example, we run:

1
2
module purge
module load icc/2017.4.196-GCC-6.4.0-2.28 impi/2017.3.196 Trinity/2.6.6

Most basic run (transcript assembly)

A typical Trinity command for assembling strand-specific paired-end RNA-seq data would look like:

A typical run of Trinity

1
2
3
4
5
6
7
Trinity \
  --seqType fq \
  --max_memory 2G \
  --left reads.left.fq \
  --right reads.right.fq \
  --SS_lib_type RF \
  --CPU 10

This will generate output files in a new directory trinity_out_dir in the working directory. Among them, the assembled transcripts file is "Trinity.fasta". For more detail, check out https://github.com/trinityrnaseq/trinityrnaseq/wiki.

When you submit the above command as a job to the cluster, you need to request 10 CPUs in the sbatch script with the following lines (in addition to your other sbatch directives):

sbatch code snippet

1
2
3
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10

Transcript quantification

Trinity provides abundant utility scripts for post-assembly analysis, such as quality assessment, transcript quantification and differential expression tests. For some of them, external software tools need to be installed separately (that is, they are not bundled with Trinity). For example, for the transcript quantification step, we will need one of RSEM, eXpress, kalllisto and salmon (cf. https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification). We have made all these four available on the HPCC. As instructed by Trinity, "the tools should be available via your PATH setting". So, in the next example where we choose to use RSEM to align reads to the assembled transcript and then quantify transcript abundance, we first set the PATH variable so that RSEM can be automatically searched for by trinity.

Using RSEM for transcript quantification

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Assuming
#    1) you've loaded Trinity module already and
#    2) your current working directory is trinity_out_dir generated from the previous assembly step. 

export PATH=/opt/software/RSEM/1.3.1-GCCcore-6.4.0/usr/local/bin:$PATH

/opt/software/Trinity/2.6.6/util/align_and_estimate_abundance.pl --seqType fq --transcripts Trinity.fasta \
    --est_method RSEM \
    --left ../reads.left.fq \
    --right ../reads.right.fq \
    --SS_lib_type RF \
    --aln_method bowtie \
    --trinity_mode \
    --prep_reference \
    --thread_count 10 \
    --output_dir RSEM_out

The RSEM computation generates two primary output files containing estimated abundances in the subdirectory RSEM_out as specified in the command above: RSEM.isoforms.results (transcript level) and RSEM.genes.results (gene level).

More utilities

Please consult https://github.com/trinityrnaseq/trinityrnaseq/wiki for detail.

Note that a few R packages are needed for differential expression analysis (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression). These have been installed in R/4.0.2 which can be loaded by

1
module purge; module load GCC/8.3.0 OpenMPI/3.1.4 R/4.0.2

Version note

The latest version is 2.91. After loading it, you may load R 4.0.2 for DE analysis.

module purge
module load GCC/8.3.0 OpenMPI/3.1.4 R/4.0.2 Trinity/2.9.1
module load R/4.0.2