Frequently Asked Questions (FAQ)

This page lists many of our frequently asked questions. Please search for keywords related to an issue by using Ctrl+F (on Windows/Linux) or Cmd+F (on Mac), or scroll through the list of questions in the table of contents to the right.

If you don't see an answer to your question, please contact us.

Logging in and accessing the HPCC

What is my HPCC user name/password?

If you are affiliated with MSU, then your MSU NetID is your user name, and your NetID password is your HPCC password. This is the same as those for all the MSU online services. An HPCC account must be requested by an MSU faculty member at https://contact.icer.msu.edu/account

There are two ways you can be blocked by entering an incorrect password too many times. The authentication on the HPCC is directly tied to MSU. If you attempt an incorrect password too many times, you may need to request a password reset at https://netid.msu.edu/netid/password/index.html. The HPCC also maintains blocks from hosts with too many failed attempted SSH connections. Users that can log into other MSU resources (Spartan365, D2L, EBS) but are unable to connect to the HPCC should submit a ticket on our contact forms. Be sure to include your external IP address; you can check it with Google.

I used to be able to connect to the HPCC server, but now I can't. Why?

There can be multiple reasons for this, such as system downtime (so please check the ICER blog first). Another common reason is account expiry. The HPCC periodically disables users who are no longer affiliated with the university or registered with a class for which the instructor has created temporary student accounts. To re-activate your HPCC account, please have your PI submit a sponsoring form at https://contact.icer.msu.edu/sponsoredrenewal

I get a "Permission denied" error, but I put in the right password. What's wrong?

If you are attempting to connect to the rsync.hpcc.msu.edu server, this requires a SSH key pair. See our documentation for how to generate a key pair here. Otherwise, see the question above.

Can I use HPC through web browsers?

Yes, we provide Open OnDemand, a web portal for easy web access to the HPCC. Check out this tutorial.

Limits and usage

Are there any limits per user on using the HPCC resources?

Dev node limits

Each process on a dev-node is limited to 2 CPU hours. If you are running a multi-threaded program, the wall time limit would be (roughly) 2 hours divided by the number of threads.

Limits on storage

Each user has up to 1 TB of storage for free and 1 million files, for each of the home and research directories. Beyond 1 TB, the cost is $89 per TB per year for MSU users.
For scratch space (i.e. /mnt/scratch/<your_user_name>), 50 TB is the maximum; more may be requested via contact forms with center director approval.).

Limits on cluster usage

the longest wall time you can request is 7 days;
the maximum number of CPU cores you can use is 1040 at any one time (see SLURM variable QOSMaxCpuPerUserLimit), unless you have a larger buy-in and your PI has requested that your buy-in account only run on the buy-in nodes;
the maximum number of jobs that can be queued is 1000 and 520 running at any one time (except in the scavenger queue);
non-buyin users have a maximum of 500,000 CPU hours per year.

I would like to know more about the dev-node limit?

When you connect to any of the HPCC's dev-nodes, you will see the following message:

processes on development nodes are limited to two hours of CPU time.

The two hour CPU time limit is for each process you run on that dev-node. If one process uses CPU time greater than 2 hours, then only that process will be killed. You can, however, still connect to that dev-node, and run another process. Additionally, if your process uses 100% CPU (1 core), it will be terminated in two hours. If your process uses 200% CPU (2 cores), it will be terminated in one hour, and so on.

How do I check my CPU or GPU time usage?

Run the command SLURMUsage for both CPU and GPU.

NOTE: This time usage does not include time that was submitted to a buyin node.

If you would like to get full usage data, including all buyin usage, you can run sreport to get the information for specific date ranges: sreport job SizesByAccount Users=$USER start=2023-01-01 end=now -t hour. This report is broken down by job size and all columns should be summed for the total usage in hours.
A detailed accounting report can be generated with sacct -X --duplicates -u $USER -S 2023-01-01 -E 2023-04-03 -o jobid,ncpus,elapsedraw,CPUTimeRaw. This output should be saved to a file and the CPUTimeRaw column summed for total hours. CPUTimeRaw is equal to ncpus * elapsedraw.

How to check the HPCC node usage?

Users can see this information by simply running the node_status command on any dev node. We also offer a web-based dashboard at https://icer.msu.edu/dashboard.

Storage and files

Quota

Quota issues writing to research spaces

Many users have reported problems copying or transferring files to their research space. Although their research space still has plenty of space, they still get the following error message:

failed to ... Disk quota exceeded

This problem may occur because you do not have your primary group set to match the research space or the folders which you copy or transfer files to have incorrect group ownership or no set-group-ID. Please read the instructions for using a research space, in particular, point 5.

Begin by checking your storage usage with the quota command. Then compare with the results from running the file-count powertool:

module load powertools
file-count

For more detailed information including a count of files in each subdirectory, use

1	`file-count --detail`

If you find that you are over quota, please delete files, move them to another location (like a research space or if they are temporary, scratch space), or move them off of the HPCC. If this resolves the issue, then you may consider keeping your files in a different location or asking for more space in your home directory.

If you don't see a change in your quota, the number reported by quota and file-count are extremely different, or your quota is showing unrealistic numbers like negative or extremely large file counts, please contact ICER as this is likely the result of an unresolved issue with one of the HPCC's storage systems.

Sometimes exceeding your quota can stop you from being able to login or access systems like OnDemand because they require writing to a small file. If this is the case, please contact ICER

My files in the scratch space are gone?

Files in scratch are automatically purged if the last changed time is older than 45 days. Note that the scratch spaces are not intended for long-term storage. Files saved in scratch have no back-up.

How do I copy files from/to my MS One Drive/Google Drive?

Rclone is currently installed on the HPCC. This software supports research in the cloud and helps HPCC users to sync files and directories between MSU’s HPCC and their cloud storage, including OneDrive and Google Drive. Please refer to Rclone

What is HPCC's data protection policy?

All of the HPCC's shared storage systems are protected against individual drive and storage node failure (using RAID and highly availabile, active-active servers.)

We maintain an offsite disaster recovery system for users' home and research directories. We do not archive users' scratch spaces nor the persistent 'nodr' space.

Our goal is to maintain hourly snapshots for the last 24 hours and 60 days of file history on the disaster recovery servers. However, when there is a significant amount of data written, there may be a delay in copying updated data to the disaster recovery servers. Users that have hard requirements for should consider using MSU's Data Storage Finder.

Users may request older versions of their files via the contact forms.

Does HPCC offer a cheaper long-term archiving plan?

We do not. However, MSU offers the Data Storage Finder (https://data-storage-finder.tech.msu.edu, on-campus only). There are several possible options for data archiving.

Submitting jobs and running code

I have a buyin account, do I need to specify it when I submit jobs?

No, unless your PI has requested that it be opt-in instead of the default. When submitting a job without specifying an account, your default account is used. You can check your default account using the "buyin_status -l" command; buyin user's default is their buyin account. We recommend you read this if you have purchased buyin nodes.

Do you support running GPU jobs?

Yes. There are three GPU dev-nodes and a series of compute nodes in the cluster; see Cluster resources.

What does the message "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" mean after my job is submitted?

Once a job is submitted the scheduler adds it to the calculations and continues to update the status of the job as the system works. The status for a job will reflect the current state of the scheduler, so you will see this message update once the scheduler has found a place to put the job. There are always some nodes which are down or drained in the cluster due to normal maintenance, but the "reserved for jobs in higher priority partitions" is the important part, and simply indicates that the scheduler has not yet found a time to schedule the job. This will update as the scheduler continues to function.

Why did I get an "Illegal Instruction" error?

This is usually because a program was compiled on a newer CPU architecture (e.g., intel18) but then run on an older one (e.g., intel14). Our system has a range of CPUs, and the newest versions support new instructions not available on the older CPUs. One short-term fix is to run programs on the same CPU that they were compiled on. Based on our experience, this error has occurred only on intel14 nodes and therefore you need to avoid them. That is, for dev-node testing, pick one from dev-intel16, dev-intel16-k80 and dev-intel18. For job submission, add #SBATCH--constraint="[intel16|intel18]"in your SLURM script.

Software and modules

I want to install software packages, what should I do?

The HPCC has a lot of software installed already. Search for the software you want to install using module spider <software name>, then follow the instructions provided by the output to use module load and use the software. See our documentation on this subject here. We have additional documentation on the module system here.

If the software is not present, you can submit a ticket. However, we encourage users to install software on their own, if possible. The HPCC has provided numerous versions of compilers and libraries which should accommodate the vast majority of software across different fields.

If you are thinking of requesting the system-wide installation of a piece of software, we strongly recommend you check the following factors when submitting a request for software installation:

How popular is the software? If it is not a popular software, are there other users on HPCC who would also be using it? If you are the only one using it, we would recommend it be installed in your home directory.
What type of license agreement does the software have? Some software licenses may restrict use even when they are free. Examples include software with export control, specific end-user license agreement, etc. When software licenses restrict use, we typically recommend the user directly make an agreement with the software provider to obtain and install it in their home directory. If it will be used by a group of people, HPCC system administrators can help with setting up the group access in compliance with the license agreement.
Is the software well maintained and up-to-date? If the software you wish to install is legacy software or is not being well maintained, chances are its installation will require an older version of its dependencies as well. The effort to install this software may then be greater than the effort required to find an up-to-date software with the same, similar, or even better functionality. It may be time to consider transitioning to using a newer software.

Why did my "module load" command output errors?

There are many reasons that errors occur when you try loading a module. However, the most common cause is that you have forgotten to run module purge. Sometimes, module spider can also fail to find the module. Most likely it's because your personal module cache is out of date. To clear it, run rm -r ~/.lmod.d/.cache.

What should I do when I cannot load modules?

See How to find and load software modules.

What is powertools?

The powertools module is a collection of software tools and examples that allows researchers to better utilize HPC systems. Powertools was created to help advanced users use the HPCC more effectively. To learn more about powertools, run the command powertools.

Python and Conda

How do I use Python on the HPCC?

There are two methods: users can install their own version of Python with Anaconda or use the versions of Python installed on the HPCC system. See here.

I have a Python conflict. What should I do to resolve it?

Upon login to a dev-node, a default module list will load automatically. Since Python/3.6.4 is included in the list, it can interfere with a user's conda environment. As a consequence, your program may not be able to find packages installed in your conda environment even if it has been activated. In other words, the program still picks up Python/3.6.4 in the module system. The solution is to run module unload Python before activating the conda environment.

How do I deactivate Conda base environment?

Many users have reported that after a local installation of Anaconda on the HPCC, their login prompt changes to something starting with (base) -bash-4.2$. This is because conda activates the default environment, base, upon startup. To disable this behavior, which often results in conflicts with system defaults, users can run the following command:

1	`conda config --set auto_activate_base False`

I tried to start a Jupyter Notebook through OnDemand, but my job will not start or will not recognize my Conda environment

All Conda environments used with the Jupyter Notebook OnDemand app must have Jupyter installed. Without this, the OnDemand job status will stay stuck on

Your session is currently starting... Please be patient as this process can take a few minutes.

before moving to

For debugging purposes, this card will be retained for 6 more days

without giving the chance to start the notebook. Depending on the setup, the job may start, but the environment will not be properly recongized and the app will fall back to the default version installed on the HPCC.

To install Jupyter in your Conda environment on the command line, activate it first by running

module load Conda/3
conda activate <environment_name>

and then run

1	`conda install jupyter`

I tried to use python matplotlib to plot, but got an error of "No module named '_tkinter'"

If you use the default python module (/opt/software/Python/3.6.4-foss-2018a/bin/python) on a dev-node, you need to load the Tkinter module before using python in order to proceed without errors. Run: module load Tkinter/3.6.4-Python-3.6.4

Getting help

Can you keep me posted on the current status of the HPCC?

Yes. Users are encouraged to follow the HPCC Announcements blog to keep updated on the status of HPCC (such as scheduled downtimes and urgent notices).

We do not go to your directory to view files or test your code for that matter. Please send your files along with your reply to the ticket email.