Frequently Asked Questions (FAQ)

This page lists many of our frequently asked questions. Please search for keywords related to an issue by using Ctrl+F (on Windows/Linux) or Cmd+F (on Mac), or scroll through the list of questions in the table of contents to the right.

If you don't see an answer to your question, please contact us.

Logging in and accessing the HPCC

What is my HPCC user name/password?

If you are affiliated with MSU, then your MSU NetID is your user name, and your NetID password is your HPCC password. This is the same as those for all the MSU online services. An HPCC account must be requested by an MSU faculty member at https://contact.icer.msu.edu/account

There are two ways you can be blocked by entering an incorrect password too many times. The authentication on the HPCC is directly tied to MSU. If you attempt an incorrect password too many times, you may need to request a password reset at https://netid.msu.edu/netid/password/index.html. The HPCC also maintains blocks from hosts with too many failed attempted SSH connections. Users that can log into other MSU resources (Spartan365, D2L, EBS) but are unable to connect to the HPCC should submit a ticket on our contact forms. Be sure to include your external IP address; you can check it with Google.

I used to be able to connect to the HPCC server, but now I can't. Why?

There can be multiple reasons for this, such as system downtime (so please check the ICER blog first). Another common reason is account expiry. The HPCC periodically disables users who are no longer affiliated with the university or registered with a class for which the instructor has created temporary student accounts. To re-activate your HPCC account, please have your PI submit a sponsoring form at https://contact.icer.msu.edu/sponsoredrenewal

I get a "Permission denied" error, but I put in the right password. What's wrong?

If you are attempting to connect to the rsync.hpcc.msu.edu server, this requires a SSH key pair. See our documentation for how to generate a key pair here. Otherwise, see the question above.

I get an error like "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!"

The following errors can occur when the HPCC upgrades a development node or changes it's identifying information (called "host keys"):

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
SHA256:rydhZ58BeXsXgWQisWlbH6E0IFG+2+LSxC9a7OfZBro.
Please contact your system administrator.
Add correct host key in <location>/.ssh/known_hosts to get rid of this message.
Offending RSA key in <location>/.ssh/known_hosts:6
You can use following command to remove the offending key:
ssh-keygen -R dev-amd20 -f <location>/.ssh/known_hosts
Host key for dev-amd20 has changed and you have requested strict checking.
Host key verification failed.

To fix this error, please see this page.

Can I use HPC through web browsers?

Yes, we provide Open OnDemand, a web portal for easy web access to the HPCC. Check out this tutorial.

Limits and usage

Are there any limits per user on using the HPCC resources?

Dev node limits

Each process on a dev-node is limited to 2 CPU hours. If you are running a multi-threaded program, the wall time limit would be (roughly) 2 hours divided by the number of threads.

Limits on storage

Each user has a limit of 100 GB of storage and 1 million files in their home directories.
PIs can request additional storage in research spaces. By default, each research space is given 250GB, but a PI can request a total of 3TB for free (divided amongst their various research spaces). Beyond 3 TB, the cost is $89 per TB per year for MSU users.
For scratch space (i.e. /mnt/scratch/<your_user_name>), 50 TB is the maximum; more may be requested via contact forms with center director approval.).

Limits on cluster usage

the longest wall time you can request is 7 days;
the maximum number of CPU cores you can use is 1040 at any one time (see SLURM variable QOSMaxCpuPerUserLimit), unless you have a larger buy-in and your PI has requested that your buy-in account only run on the buy-in nodes;
the maximum number of jobs that can be queued is 1000 and 520 running at any one time (except in the scavenger queue);
non-buyin users have a maximum of 500,000 CPU hours per year.
non-buyin users have a maximum of 10,000 GPU hours per year.

I would like to know more about the dev-node limit?

When you connect to any of the HPCC's dev-nodes, you will see the following message:

processes on development nodes are limited to two hours of CPU time.

The two hour CPU time limit is for each process you run on that dev-node. If one process uses CPU time greater than 2 hours, then only that process will be killed. You can, however, still connect to that dev-node, and run another process. Additionally, if your process uses 100% CPU (1 core), it will be terminated in two hours. If your process uses 200% CPU (2 cores), it will be terminated in one hour, and so on.

How do I check my CPU or GPU time usage?

Run the command SLURMUsage for both CPU and GPU.

NOTE: This time usage does not include time that was submitted to a buyin node.

If you would like to get full usage data, including all buyin usage, you can run sreport to get the information for specific date ranges: sreport job SizesByAccount Users=$USER start=2023-01-01 end=now -t hour. This report is broken down by job size and all columns should be summed for the total usage in hours.
A detailed accounting report can be generated with sacct -X --duplicates -u $USER -S 2023-01-01 -E 2023-04-03 -o jobid,ncpus,elapsedraw,CPUTimeRaw. This output should be saved to a file and the CPUTimeRaw column summed for total hours. CPUTimeRaw is equal to ncpus * elapsedraw.

How to check the HPCC node usage?

Users can see this information by simply running the node_status command on any dev node. We also offer a web-based dashboard at https://icer.msu.edu/dashboard.

Storage and files

Quota

Quota issues writing to research spaces

Many users have reported problems copying or transferring files to their research space. Although their research space still has plenty of space, they still get the following error message:

failed to ... Disk quota exceeded

This problem may occur because you do not have your primary group set to match the research space or the folders which you copy or transfer files to have incorrect group ownership or no set-group-ID. Please read the instructions for using a research space, in particular, point 5.

Begin by checking your storage usage with the quota command. Then compare with the results from running the file-count powertool:

module load powertools
file-count

For more detailed information including a count of files in each subdirectory, use

file-count --detail

If you find that you are over quota, please delete files, move them to another location (like a research space or if they are temporary, scratch space), or move them off of the HPCC. If this resolves the issue, then you may consider keeping your files in a different location.

If you don't see a change in your quota, the number reported by quota and file-count are extremely different, or your quota is showing unrealistic numbers like negative or extremely large file counts, please contact ICER as this is likely the result of an unresolved issue with one of the HPCC's storage systems.

Sometimes exceeding your quota can stop you from being able to login or access systems like OnDemand because they require writing to a small file. If this is the case, please contact ICER

My files in the scratch space are gone?

Files in scratch are automatically purged if the last changed time is older than 45 days. Note that the scratch spaces are not intended for long-term storage. Files saved in scratch have no back-up.

How do I copy files from/to my MS One Drive/Google Drive?

Rclone is currently installed on the HPCC. This software supports research in the cloud and helps HPCC users to sync files and directories between MSU’s HPCC and their cloud storage, including OneDrive and Google Drive. Please refer to Rclone

What is HPCC's data protection policy?

All of the HPCC's shared storage systems are protected against individual drive and storage node failure (using RAID and highly available, active-active servers.)

We maintain an offsite disaster recovery system for users' home and research directories. We do not archive users' scratch spaces nor the persistent 'nodr' space.

Our goal is to maintain hourly snapshots for the last 24 hours and 60 days of file history on the disaster recovery servers. However, when there is a significant amount of data written, there may be a delay in copying updated data to the disaster recovery servers. Users that have hard requirements for should consider using MSU's Data Storage Finder.

Users may request older versions of their files via the contact forms.

Does HPCC offer a cheaper long-term archiving plan?

We do not. However, MSU offers the Data Storage Finder (https://data-storage-finder.tech.msu.edu, on-campus only). There are several possible options for data archiving.

I am using cloud storage, what is the HPCC's IP address so I can limit access?

Requests from the HPCC will come from the 35.9.12.0/24 and 35.12.240.0/24 subnets. The rsync gateway is part of the 35.12.240.0/24 range.

Submitting jobs and running code

I have a buyin account, do I need to specify it when I submit jobs?

No, unless your PI has requested that it be opt-in instead of the default. When submitting a job without specifying an account, your default account is used. You can check your default account using the "buyin_status -l" command; buyin user's default is their buyin account. We recommend you read this if you have purchased buyin nodes.

Do you support running GPU jobs?

Yes. There are three GPU dev-nodes and a series of compute nodes in the cluster; see Cluster resources.

What does the message "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" mean after my job is submitted?

Once a job is submitted the scheduler adds it to the calculations and continues to update the status of the job as the system works. The status for a job will reflect the current state of the scheduler, so you will see this message update once the scheduler has found a place to put the job. There are always some nodes which are down or drained in the cluster due to normal maintenance, but the "reserved for jobs in higher priority partitions" is the important part, and simply indicates that the scheduler has not yet found a time to schedule the job. This will update as the scheduler continues to function.

Why did I get an "Illegal Instruction" error?

This is usually because a program was compiled on a newer CPU architecture (e.g., amd24) but then run on an older one (e.g., intel16). Our system has a range of CPUs, and the newest versions support new instructions not available on the older CPUs. If you receive this error using one of ICER's software modules, make sure that your SLURM script starts with

#!/bin/bash --login

This will ensure that your script undergoes the proper setup to use the software that matches the CPUs where your jobs are running.

For more advice and troubleshooting steps, see our Lab Notebook on Architecture Specific Compilation.

Why do I get a `module: not found` error in my slurm output?

Job scripts that start with #!/bin/sh will result in errors like

/var/lib/slurmd/jobXXXXXXXX/slurm_script: XX: module: not found.
Please change this line to #!/bin/bash --login.

/bin/sh is a symbolic link in modern Linux distributions and does not always link to the same shell. A better method on current Linux distributions is to explicitly call the bash shell with #!/bin/bash --login in your script if needed.

What does "OOM" mean in an error message about my job?

If you see errors in your job that contain the characters "OOM", "oom" or "oom_kill", it means your job ran Out Of Memory. You should try requesting more memory in your job specification. Check our SLURM job guide for information on specifying memory.

Software and modules

I want to install software packages, what should I do?

The HPCC has a lot of software installed already. Search for the software you want to install using module spider <software name>, then follow the instructions provided by the output to use module load and use the software. See our documentation on this subject here. We have additional documentation on the module system here.

If the software is not present, you can submit a ticket. However, we encourage users to install software on their own, if possible. The HPCC has provided numerous versions of compilers and libraries which should accommodate the vast majority of software across different fields.

If you are thinking of requesting the system-wide installation of a piece of software, we strongly recommend you check the following factors when submitting a request for software installation:

How popular is the software? If it is not a popular software, are there other users on HPCC who would also be using it? If you are the only one using it, we would recommend it be installed in your home directory.
What type of license agreement does the software have? Some software licenses may restrict use even when they are free. Examples include software with export control, specific end-user license agreement, etc. When software licenses restrict use, we typically recommend the user directly make an agreement with the software provider to obtain and install it in their home directory. If it will be used by a group of people, HPCC system administrators can help with setting up the group access in compliance with the license agreement.
Is the software well maintained and up-to-date? If the software you wish to install is legacy software or is not being well maintained, chances are its installation will require an older version of its dependencies as well. The effort to install this software may then be greater than the effort required to find an up-to-date software with the same, similar, or even better functionality. It may be time to consider transitioning to using a newer software.
Is the software available through EasyBuild with an ICER supported toolchain? ICER uses a tool called EasyBuild to install software that is provided via "recipe" files or EasyConfigs. See the entire list of available EasyConfigs. Note that software is installed with a "toolchain" (see our EasyBuild tutorial for more information). ICER only officially supports software installed under the following toolchains or subtoolchains (including gfbf, gompi, iimpi, iimkl):
foss/2022b
foss/2023a
foss/2023b
intel/2022b
intel/2023a
intel/2023b

Why did my "module load" command output errors?

There are many reasons that errors occur when you try loading a module. However, the most common cause is that you have forgotten to run module purge. Sometimes, module spider can also fail to find the module. Most likely it's because your personal module cache is out of date. To clear it, run rm ~/.cache/lmod/spider*.

What should I do when I cannot load modules?

See How to find and load software modules.

The `module` command is missing in VS Code

Try running source /etc/profile on login. See also this issue on VS Code's GitHub.

What is powertools?

The powertools module is a collection of software tools and examples that allows researchers to better utilize HPC systems. Powertools was created to help advanced users use the HPCC more effectively. To learn more about powertools, run the command powertools.

OnDemand

The OnDemand job composer doesn't work

The OnDemand server is in the process of being upgraded to match the new operating system. Until then, the job composer is not functional. Please contact us for help writing and submitting job scripts in the meantime.

When I use the OnDemand Interactive Desktop, I get the error "The panel encountered a problem while loading 'IndicatorAppletCompleteFactory::IndicatorAppletComplete'. Do you want to delete the applet from your configuration?"

Click the "Delete" option, and this error will not return in the future.

I can't open Firefox from an Interactive Desktop in OnDemand

Run mv ~/.mozilla ~/.mozilla_backup.

Python and Conda

How do I use Python on the HPCC?

There are two methods: users can install their own version of Python with Conda or use the versions of Python installed on the HPCC system. See here.

I have a Python conflict. What should I do to resolve it?

Upon login to a dev-node, a default module list will load automatically. Since Python/3.6.4 is included in the list, it can interfere with a user's conda environment. As a consequence, your program may not be able to find packages installed in your conda environment even if it has been activated. In other words, the program still picks up Python/3.6.4 in the module system. The solution is to run module unload Python before activating the conda environment.

How do I deactivate Conda base environment?

Many users have reported that after a local installation of Conda on the HPCC, their login prompt changes to something starting with (base) -bash-4.2$. This is because conda activates the default environment, base, upon startup. To disable this behavior, which often results in conflicts with system defaults, users can run the following command:

conda config --set auto_activate_base False

I tried to start a Jupyter Notebook through OnDemand, but my job will not start or will not recognize my Conda environment

All Conda environments used with the Jupyter Notebook OnDemand app must have Jupyter installed. Without this, the OnDemand job status will stay stuck on

Your session is currently starting... Please be patient as this process can take a few minutes.

before moving to

For debugging purposes, this card will be retained for 6 more days

without giving the chance to start the notebook. Depending on the setup, the job may start, but the environment will not be properly recognized and the app will fall back to the default version installed on the HPCC.

To install Jupyter in your Conda environment on the command line, activate it first by running

module load Conda/3
conda activate <environment_name>

and then run

conda install jupyter

I tried to use python matplotlib to plot, but got an error of "No module named '_tkinter'"

If you use the default python module (Python/3.11.3-GCCcore-12.3.0) on a dev-node, you need to load the Tkinter module before using python in order to proceed without errors. Run: module load Tkinter/3.11.3-GCCcore-12.3.0

What is the correct Python shebang so that the module system Python is used?

#!/usr/bin/env python is the recommended shebang for Python scripts that you are running directly on the HPCC using our installed Python modules. Then run your script with python script.py, in command line or in job submission scripts.

R and RStudio Server

When I start RStudio Server, all I see is a gray blank screen.

Open a command line and run

mv .local/share/rstudio .local/share/rstudio.backup

This will move your RStudio configuration files to a backup. Note that this will likely reset your RStudio session, so you may need to reopen previous projects and files, and could lose any unsaved work. See this Lab Notebook for more details.

I can't install the R package "Matrix" or other packages that need it like "ggplot2"

The R package Matrix (which is a dependency for many other packages including ggplot2) is incompatible with versions of R earlier than 4.4.0 (i.e., all of the versions installed on the new operating system. We recommend using the R-bundle-CRAN module instead of R which includes a pre-installed version of Matrix. If you need to install it yourself using install.packages, use the command

install.packages(https://cran.r-project.org/src/contrib/Archive/Matrix/Matrix_1.6-5.tar.gz, repos=NULL, type="source")

Getting help

Can you keep me posted on the current status of the HPCC?

Yes. Users are encouraged to follow the HPCC Announcements blog to keep updated on the status of HPCC (such as scheduled downtimes and urgent notices).

We do not go to your directory to view files or test your code for that matter. Please send your files along with your reply to the ticket email.