Warning

This is as a Lab Notebook which describes how to solve a specific problem at a specific time. Please keep this in mind as you read and use the content. Please pay close attention to the date, version information and other details.

Change to modules in SLURM jobs

Summary: The way that modules are loaded in SLURM jobs is changing slightly. SLURM jobs will now require you to load modules before you use them in your SLURM script. Previously, this was only a best practice recommended by ICER.

What do I need to do?:

graph TD
  A[Start] --> B[Do you only use the default modules?];
  B -->|Yes| C([<b>You have to do nothing!</b>]);
  B -->|No| D[Where do you load modules for your job?];
  D -->|In the batch script| E([<b>You have to do nothing!</b>]);
  D -->|On the development node| F([<b>Add <code>module load</code> lines to your SLURM script.</b>]);

If you use a workflow manager like Nextflow or Snakemake and are using non-default modules, please see the recommendations below.

Why is this happening?

With the new module system, ICER is able to build software adapted to the specific types of nodes in the HPCC. For example, our intel18 nodes have capabilities like AVX-512 that are not available in the amd20 nodes.

Previously, ICER would build one version of the software that works on every type of node. Now we have the capability to build multiple versions of software each adapted to the unique capabilities of our hardware generations. However, this means that when you load one of these "node-adapted" modules, that same module gets used in the SLURM job no matter where in the HPCC it runs. This leads to "illegal instruction" errors when a software built for newer capabilities runs on a node without those capabilities.

By making this change, SLURM will load modules from the collection adapted to the node the job is running on, no matter what development node was used to submit that job. This means that your code will run as quickly and efficiently as possible on the nodes that SLURM assigns it.

While there are other solutions (like constraining your job to the same type of node that you are submitting from) this solution is most flexible, is in line with our previous recommendations, and allows you access to the largest collection of nodes at once, reducing queue times.

What exactly is being changed?

ICER is changing the way that the module system and SLURM interact. The way it is now, SLURM inherits the entire environment of the development node you submit your job from including all loaded modules and all changes to the module path (the location where modules are found).

In the new configuration, SLURM jobs will start by resetting all loaded modules inherited from the development node back to the appropriate defaults for your assigned compute node and changing the module path accordingly (see above)

What do I need to do?

If you already load all modules in your SLURM scripts before you use them (as is recommended by ICER), you don't need to make any changes!

However, if you load non-default modules on the development nodes and then use those modules in your SLURM scripts, please add those module load commands to your script before you use the programs in those modules. Additionally, if you make any changes to your module path using the module use command (e.g., you are loading modules that are not provided by ICER), make sure you do this before you load modules in your SLURM scripts as well.

Example

Suppose that in a typical session, I have a SLURM script that looks like

script.sb

#!/bin/bash --login

#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=2
#SBATCH --mem=32GB

stata do myjob.do

The stata command comes from the Stata module. Before I submit this script, I login and load modules like:

ssh user@hpcc.msu.edu
ssh dev-amd20
module purge
module load Stata/18-MP
sbatch script.sb

This will no longer work because when the SLURM job starts, the Stata module will be unloaded and replaced by all default modules (which do not include Stata). The fix is to add the module purge and module load Stata/18-MP lines to the beginning of the SLURM script like:

script_fixed.sb

#!/bin/bash --login

#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=2
#SBATCH --mem=32GB

module purge
module load Stata/18-MP

stata do myjob.do

You no longer have to load the modules before submitting the job.

Special considerations for Nextflow and Snakemake

Nextflow and Snakemake are two workflow managers that can submit jobs to SLURM for you. Since they build the SLURM scripts, you will need to take extra measures to ensure that they load the required modules in the steps where they are used.

Nextflow

In Nextflow, add the modules you need to the process definition using the module directive. For examples and more information, please see Nextflow's documentation.

Snakemake

In Snakemake, add the modules you need to the rule using the envmodules key and run Snakemake with the --use-envmodules flag. For examples and more information, please see Snakemake's documentation.