Trinity for RNA-seq de novo assembly
Loading module
Take loading Trinity 2.15.1 as an example, we run:
module purge
module load Trinity/2.15.1-foss-2023a
Most basic run (transcript assembly)
A typical Trinity command for assembling strand-specific paired-end RNA-seq data would look like:
A typical run of Trinity
Trinity \
--seqType fq \
--max_memory 2G \
--left reads.left.fq \
--right reads.right.fq \
--SS_lib_type RF \
--CPU 10
This will generate output files in a new directory trinity_out_dir
in the working directory. Among them, the assembled transcripts file is
"Trinity.fasta
". For more detail, check
out https://github.com/trinityrnaseq/trinityrnaseq/wiki.
When you submit the above command as a job to the cluster, you need to request 10 CPUs in the sbatch script with the following lines (in addition to your other sbatch directives):
sbatch code snippet
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
Transcript quantification
Trinity provides abundant utility scripts for post-assembly analysis,
such as quality assessment, transcript quantification and differential
expression tests. For some of them, external software tools need to be
installed separately (that is, they are not bundled with Trinity). For
example, for the transcript quantification step, we will need one of
RSEM, eXpress, kalllisto and salmon (cf. https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification).
We have made all these four available on the HPCC. As instructed by
Trinity, "the tools should be available via your PATH setting". So, in
the next example where we choose to use RSEM to align reads to the
assembled transcript and then quantify transcript abundance, we first
set the PATH
variable so that RSEM can be automatically searched for
by trinity.
Using RSEM for transcript quantification
# Assuming
# 1) you've loaded Trinity module already and
# 2) your current working directory is trinity_out_dir generated from the previous assembly step.
$EBROOTTRINITY/trinityrnaseq-v2.15.1/util/align_and_estimate_abundance.pl --seqType fq --transcripts Trinity.fasta \
--est_method RSEM \
--left ../reads.left.fq \
--right ../reads.right.fq \
--SS_lib_type RF \
--aln_method bowtie \
--trinity_mode \
--prep_reference \
--thread_count 10 \
--output_dir RSEM_out
The RSEM computation generates two primary output files containing
estimated abundances in the subdirectory RSEM_out
as specified in the
command above: RSEM.isoforms.results
(transcript level)
and RSEM.genes.results
(gene level).
More utilities
Please consult https://github.com/trinityrnaseq/trinityrnaseq/wiki for detail.