Frequently Asked Questions (FAQ)
This page lists many of our frequently asked questions. Please search for keywords related to an issue by using Ctrl+F (on Windows/Linux) or Cmd+F (on Mac), or scroll through the list of questions in the table of contents to the right.
If you don't see an answer to your question, please contact us.
Table of contents
- Logging in and accessing the HPCC
- Limits and usage
- Storage and files
- Submitting jobs and running code
- Software and modules
- OnDemand
- Python and Conda
- R and RStudio Server
- Getting help
Logging in and accessing the HPCC
What is my HPCC user name/password?
If you are affiliated with MSU, then your MSU NetID is your user name, and your NetID password is your HPCC password. This is the same as those for all the MSU online services. An HPCC account must be requested by an MSU faculty member at https://contact.icer.msu.edu/account
Can I reset my password on the HPCC because my login got denied after multiple failed attempts?
There are two ways you can be blocked by entering an incorrect password too many times. The authentication on the HPCC is directly tied to MSU. If you attempt an incorrect password too many times, you may need to request a password reset at https://netid.msu.edu/netid/password/index.html. The HPCC also maintains blocks from hosts with too many failed attempted SSH connections. Users that can log into other MSU resources (Spartan365, D2L, EBS) but are unable to connect to the HPCC should submit a ticket on our contact forms. Be sure to include your external IP address; you can check it with Google.
I used to be able to connect to the HPCC server, but now I can't. Why?
There can be multiple reasons for this, such as system downtime (so please check the ICER blog first). Another common reason is account expiry. The HPCC periodically disables users who are no longer affiliated with the university or registered with a class for which the instructor has created temporary student accounts. To re-activate your HPCC account, please have your PI submit a sponsoring form at https://contact.icer.msu.edu/sponsoredrenewal
I get a "Permission denied" error, but I put in the right password. What's wrong?
If you are attempting to connect to the rsync.hpcc.msu.edu
server, this
requires a SSH key pair. See our documentation for how to generate a key pair
here. Otherwise, see the question above.
I get an error like "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!"
The following errors can occur when the HPCC upgrades a development node or changes it's identifying information (called "host keys"):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
or:
1 2 3 4 |
|
To fix the first error, verify the fingerprint matches one of the current host identification keys and run the command
1 |
|
on your local computer, not the HPCC. Alternatively, you can also delete the file ~/.ssh/known_hosts
to reset all host keys. This will result in receiving the second message anytime you SSH to any other computer (even ones outside of ICER, like GitHub) until you've accepted the connection again.
To accept the new connection and fix the second error, enter "yes" and hit enter.
Can I use HPC through web browsers?
Yes, we provide Open OnDemand, a web portal for easy web access to the HPCC. Check out this tutorial.
Limits and usage
Are there any limits per user on using the HPCC resources?
Dev node limits
Each process on a dev-node is limited to 2 CPU hours. If you are running a multi-threaded program, the wall time limit would be (roughly) 2 hours divided by the number of threads.
Limits on storage
- Each user has up to 1 TB of storage for free and 1 million files, for each of the home and research directories. Beyond 1 TB, the cost is $89 per TB per year for MSU users.
- For scratch space (i.e.
/mnt/scratch/<your_user_name>
), 50 TB is the maximum; more may be requested via contact forms with center director approval.).
Limits on cluster usage
- the longest wall time you can request is 7 days;
- the maximum number of CPU cores you can use is 1040 at any one time (see
SLURM variable
QOSMaxCpuPerUserLimit
), unless you have a larger buy-in and your PI has requested that your buy-in account only run on the buy-in nodes; - the maximum number of jobs that can be queued is 1000 and 520 running at any one time (except in the scavenger queue);
- non-buyin users have a maximum of 500,000 CPU hours per year.
I would like to know more about the dev-node limit?
When you connect to any of the HPCC's dev-nodes, you will see the following message:
processes on development nodes are limited to two hours of CPU time.
The two hour CPU time limit is for each process you run on that dev-node. If one process uses CPU time greater than 2 hours, then only that process will be killed. You can, however, still connect to that dev-node, and run another process. Additionally, if your process uses 100% CPU (1 core), it will be terminated in two hours. If your process uses 200% CPU (2 cores), it will be terminated in one hour, and so on.
How do I check my CPU or GPU time usage?
Run the command SLURMUsage
for both CPU and GPU.
NOTE: This time usage does not include time that was submitted to a buyin node.
- If you would like to get full usage data, including all buyin usage, you can
run
sreport
to get the information for specific date ranges:sreport job SizesByAccount Users=$USER start=2023-01-01 end=now -t hour
. This report is broken down by job size and all columns should be summed for the total usage in hours. - A detailed accounting report can be generated with
sacct -X --duplicates -u $USER -S 2023-01-01 -E 2023-04-03 -o jobid,ncpus,elapsedraw,CPUTimeRaw
. This output should be saved to a file and the CPUTimeRaw column summed for total hours. CPUTimeRaw is equal to ncpus * elapsedraw.
How to check the HPCC node usage?
Users can see this information by simply running the node_status
command on
any dev node. We also offer a web-based dashboard at
https://icer.msu.edu/dashboard.
Storage and files
Quota
Quota issues writing to research spaces
Many users have reported problems copying or transferring files to their research space. Although their research space still has plenty of space, they still get the following error message:
failed to ... Disk quota exceeded
This problem may occur because you do not have your primary group set to match the research space or the folders which you copy or transfer files to have incorrect group ownership or no set-group-ID. Please read the instructions for using a research space, in particular, point 5.
Quota/file limit exceeded or general issues related to writing files (especially in home directories)
Begin by checking your storage usage with the quota
command. Then compare
with the results from running the file-count
powertool:
1 2 |
|
For more detailed information including a count of files in each subdirectory, use
1 |
|
If you find that you are over quota, please delete files, move them to another location (like a research space or if they are temporary, scratch space), or move them off of the HPCC. If this resolves the issue, then you may consider keeping your files in a different location or asking for more space in your home directory.
If you don't see a change in your quota, the number reported by quota
and
file-count
are extremely different, or your quota is showing unrealistic
numbers like negative or extremely large file counts, please contact
ICER as this is likely the result of an
unresolved issue with one of the HPCC's storage systems.
Sometimes exceeding your quota can stop you from being able to login or access systems like OnDemand because they require writing to a small file. If this is the case, please contact ICER
My files in the scratch space are gone?
Files in scratch are automatically purged if the last changed time is older than 45 days. Note that the scratch spaces are not intended for long-term storage. Files saved in scratch have no back-up.
How do I copy files from/to my MS One Drive/Google Drive?
Rclone is currently installed on the HPCC. This software supports research in the cloud and helps HPCC users to sync files and directories between MSU’s HPCC and their cloud storage, including OneDrive and Google Drive. Please refer to Rclone
What is HPCC's data protection policy?
All of the HPCC's shared storage systems are protected against individual drive and storage node failure (using RAID and highly availabile, active-active servers.)
We maintain an offsite disaster recovery system for users' home and research directories. We do not archive users' scratch spaces nor the persistent 'nodr' space.
Our goal is to maintain hourly snapshots for the last 24 hours and 60 days of file history on the disaster recovery servers. However, when there is a significant amount of data written, there may be a delay in copying updated data to the disaster recovery servers. Users that have hard requirements for should consider using MSU's Data Storage Finder.
Users may request older versions of their files via the contact forms.
Does HPCC offer a cheaper long-term archiving plan?
We do not. However, MSU offers the Data Storage Finder (https://data-storage-finder.tech.msu.edu, on-campus only). There are several possible options for data archiving.
Submitting jobs and running code
I have a buyin account, do I need to specify it when I submit jobs?
No, unless your PI has requested that it be opt-in instead of the default. When submitting a job without specifying an account, your default account is used. You can check your default account using the "buyin_status -l" command; buyin user's default is their buyin account. We recommend you read this if you have purchased buyin nodes.
Do you support running GPU jobs?
Yes. There are three GPU dev-nodes and a series of compute nodes in the cluster; see Cluster resources.
What does the message "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" mean after my job is submitted?
Once a job is submitted the scheduler adds it to the calculations and continues to update the status of the job as the system works. The status for a job will reflect the current state of the scheduler, so you will see this message update once the scheduler has found a place to put the job. There are always some nodes which are down or drained in the cluster due to normal maintenance, but the "reserved for jobs in higher priority partitions" is the important part, and simply indicates that the scheduler has not yet found a time to schedule the job. This will update as the scheduler continues to function.
Why did I get an "Illegal Instruction" error?
This is usually because a program was compiled on a newer CPU architecture (e.g., intel18) but then run on an older one (e.g., intel14). Our system has a range of CPUs, and the newest versions support new instructions not available on the older CPUs. If you receive this error using one of ICER's software modules, make sure that your SLURM script starts with
1 |
|
This will ensure that your script undergoes the proper setup to use the software that matches the CPUs where your jobs are running.
For more advice and troubleshooting steps, see our Lab Notebook on Architecture Specific Compilation.
Why do I get a module: not found
error in my slurm output?
Job scripts that start with #!/bin/sh
will result in errors like
1 2 |
|
/bin/sh
is a symbolic link in modern Linux distributions and does not always
link to the same shell. A better method on current Linux distributions is to
explicitly call the bash shell with #!/bin/bash --login
in your script if
needed.
Software and modules
I want to install software packages, what should I do?
The HPCC has a lot of software installed already. Search for the software you
want to install using module spider <software name>
, then follow the
instructions provided by the output to use module load
and use the software.
See our documentation on this subject here.
We have additional documentation on the module system
here.
If the software is not present, you can submit a ticket. However, we encourage users to install software on their own, if possible. The HPCC has provided numerous versions of compilers and libraries which should accommodate the vast majority of software across different fields.
If you are thinking of requesting the system-wide installation of a piece of software, we strongly recommend you check the following factors when submitting a request for software installation:
-
How popular is the software? If it is not a popular software, are there other users on HPCC who would also be using it? If you are the only one using it, we would recommend it be installed in your home directory.
-
What type of license agreement does the software have? Some software licenses may restrict use even when they are free. Examples include software with export control, specific end-user license agreement, etc. When software licenses restrict use, we typically recommend the user directly make an agreement with the software provider to obtain and install it in their home directory. If it will be used by a group of people, HPCC system administrators can help with setting up the group access in compliance with the license agreement.
-
Is the software well maintained and up-to-date? If the software you wish to install is legacy software or is not being well maintained, chances are its installation will require an older version of its dependencies as well. The effort to install this software may then be greater than the effort required to find an up-to-date software with the same, similar, or even better functionality. It may be time to consider transitioning to using a newer software.
-
Is the software available through EasyBuild with an ICER supported toolchain? ICER uses a tool called EasyBuild to install software that is provided via "recipe" files or EasyConfigs. See the entire list of available EasyConfigs. Note that software is installed with a "toolchain" (see our EasyBuild tutorial for more information). ICER only officially supports software installed under the following toolchains or subtoolchains (including
gfbf
,gompi
,iimpi
,iimkl
): -
foss/2022b
foss/2023a
foss/2023b
intel/2022b
intel/2023a
intel/2023b
Why did my "module load" command output errors?
There are many reasons that errors occur when you try loading a module.
However, the most common cause is that you have forgotten to run module
purge
. Sometimes, module spider
can also fail to find the module. Most
likely it's because your personal module cache is out of date. To clear it, run
rm ~/.cache/lmod/spider*
.
What should I do when I cannot load modules?
See How to find and load software modules.
The module
command is missing in VS Code
Try running source /etc/profile
on login. See also this issue on VS Code's GitHub.
What is powertools?
The powertools module is a collection of software tools and examples that
allows researchers to better utilize HPC systems. Powertools was created to
help advanced users use the HPCC more effectively. To learn more about
powertools, run the command powertools
.
OnDemand
The OnDemand job composer doesn't work
The OnDemand server is in the process of being upgraded to match the new operating system. Until then, the job composer is not functional. Please contact us for help writing and submitting job scripts in the meantime.
When I use the OnDemand Interactive Desktop, I get the error "The panel encountered a problem while loading 'IndicatorAppletCompleteFactory::IndicatorAppletComplete'. Do you want to delete the applet from your configuration?"
Click the "Delete" option, and this error will not return in the future.
I can't open Firefox from an Interactive Desktop in OnDemand
Run mv ~/.mozilla ~/.mozilla_backup
.
Python and Conda
How do I use Python on the HPCC?
There are two methods: users can install their own version of Python with Conda or use the versions of Python installed on the HPCC system. See here.
I have a Python conflict. What should I do to resolve it?
Upon login to a dev-node, a default module list will load automatically.
Since Python/3.6.4 is included in the list, it can interfere with a
user's conda environment. As a consequence, your program may not be able
to find packages installed in your conda environment even if it has been
activated. In other words, the program still picks up Python/3.6.4 in
the module system. The solution is to run module unload Python
before
activating the conda environment.
How do I deactivate Conda base environment?
Many users have reported that after a local installation of Conda on
the HPCC, their login prompt changes to something starting with (base)
-bash-4.2$
. This is because conda activates the default environment,
base
, upon startup. To disable this behavior, which often results in
conflicts with system defaults, users can run the following command:
1 |
|
I tried to start a Jupyter Notebook through OnDemand, but my job will not start or will not recognize my Conda environment
All Conda environments used with the Jupyter Notebook OnDemand app must have Jupyter installed. Without this, the OnDemand job status will stay stuck on
Your session is currently starting... Please be patient as this process can take a few minutes.
before moving to
For debugging purposes, this card will be retained for 6 more days
without giving the chance to start the notebook. Depending on the setup, the job may start, but the environment will not be properly recongized and the app will fall back to the default version installed on the HPCC.
To install Jupyter in your Conda environment on the command line, activate it first by running
1 2 |
|
and then run
1 |
|
I tried to use python matplotlib to plot, but got an error of "No module named '_tkinter'"
If you use the default python module
(/opt/software/Python/3.6.4-foss-2018a/bin/python
) on a dev-node, you need to
load the Tkinter module before using python in order to proceed without errors.
Run: module load Tkinter/3.6.4-Python-3.6.4
R and RStudio Server
When I start RStudio Server, all I see is a gray blank screen.
Open a command line and run
1 |
|
This will move your RStudio configuration files to a backup. Note that this will likely reset your RStudio session, so you may need to reopen previous projects and files, and could lose any unsaved work. See this Lab Notebook for more details.
I can't install the R package "Matrix" or other packages that need it like "ggplot2"
The R package Matrix
(which is a dependency for many other packages including ggplot2
) is incompatible with versions of R earlier than 4.4.0 (i.e., all of the versions installed on the new operating system. We recommend using the R-bundle-CRAN
module instead of R
which includes a pre-installed version of Matrix
. If you need to install it yourself using install.packages
, use the command
1 |
|
Getting help
Can you keep me posted on the current status of the HPCC?
Yes. Users are encouraged to follow the HPCC Announcements blog to keep updated on the status of HPCC (such as scheduled downtimes and urgent notices).
I am looking for help to troubleshoot my problem. How do I share my code/files with you?
We do not go to your directory to view files or test your code for that matter. Please send your files along with your reply to the ticket email.