Frequently Asked Questions (FAQ)
This page lists many of our frequently asked questions. Please search for keywords related to an issue by using Ctrl+F (on Windows/Linux) or Cmd+F (on Mac), or scroll through the list of questions in the table of contents to the right.
If you don't see an answer to your question, please contact us.
Table of contents
- Logging in and accessing the HPCC
- Limits and usage
- Storage and files
- Submitting jobs and running code
- Software and modules
- Python and Conda
- Getting help
Logging in and accessing the HPCC
What is my HPCC user name/password?
If you are affiliated with MSU, then your MSU NetID is your user name, and your NetID password is your HPCC password. This is the same as those for all the MSU online services. An HPCC account must be requested by an MSU faculty member at https://contact.icer.msu.edu/account
Can I reset my password on the HPCC because my login got denied after multiple failed attempts?
There are two ways you can be blocked by entering an incorrect password too many times. The authentication on the HPCC is directly tied to MSU. If you attempt an incorrect password too many times, you may need to request a password reset at https://netid.msu.edu/netid/password/index.html. The HPCC also maintains blocks from hosts with too many failed attempted SSH connections. Users that can log into other MSU resources (Spartan365, D2L, EBS) but are unable to connect to the HPCC should submit a ticket on our contact forms. Be sure to include your external IP address; you can check it with Google.
I used to be able to connect to the HPCC server, but now I can't. Why?
There can be multiple reasons for this, such as system downtime (so please check the ICER blog first). Another common reason is account expiry. The HPCC periodically disables users who are no longer affiliated with the university or registered with a class for which the instructor has created temporary student accounts. To re-activate your HPCC account, please have your PI submit a sponsoring form at https://contact.icer.msu.edu/sponsoredrenewal
I get a "Permission denied" error, but I put in the right password. What's wrong?
If you are attempting to connect to the rsync.hpcc.msu.edu
server, this
requires a SSH key pair. See our documentation for how to generate a key pair
here. Otherwise, see the question above.
Can I use HPC through web browsers?
Yes, we provide Open OnDemand, a web portal for easy web access to the HPCC. Check out this tutorial.
Limits and usage
Are there any limits per user on using the HPCC resources?
Dev node limits
Each process on a dev-node is limited to 2 CPU hours. If you are running a multi-threaded program, the wall time limit would be (roughly) 2 hours divided by the number of threads.
Limits on storage
- Each user has up to 1 TB of storage for free and 1 million files, for each of the home and research directories. Beyond 1 TB, the cost is $89 per TB per year for MSU users.
- For scratch space (i.e.
/mnt/scratch/<your_user_name>
), 50 TB is the maximum; more may be requested via contact forms with center director approval.).
Limits on cluster usage
- the longest wall time you can request is 7 days;
- the maximum number of CPU cores you can use is 1040 at any one time (see
SLURM variable
QOSMaxCpuPerUserLimit
), unless you have a larger buy-in and your PI has requested that your buy-in account only run on the buy-in nodes; - the maximum number of jobs that can be queued is 1000 and 520 running at any one time (except in the scavenger queue);
- non-buyin users have a maximum of 500,000 CPU hours per year.
I would like to know more about the dev-node limit?
When you connect to any of the HPCC's dev-nodes, you will see the following message:
processes on development nodes are limited to two hours of CPU time.
The two hour CPU time limit is for each process you run on that dev-node. If one process uses CPU time greater than 2 hours, then only that process will be killed. You can, however, still connect to that dev-node, and run another process. Additionally, if your process uses 100% CPU (1 core), it will be terminated in two hours. If your process uses 200% CPU (2 cores), it will be terminated in one hour, and so on.
How do I check my CPU or GPU time usage?
Run the command SLURMUsage
for both CPU and GPU.
NOTE: This time usage does not include time that was submitted to a buyin node.
- If you would like to get full usage data, including all buyin usage, you can
run
sreport
to get the information for specific date ranges:sreport job SizesByAccount Users=$USER start=2023-01-01 end=now -t hour
. This report is broken down by job size and all columns should be summed for the total usage in hours. - A detailed accounting report can be generated with
sacct -X --duplicates -u $USER -S 2023-01-01 -E 2023-04-03 -o jobid,ncpus,elapsedraw,CPUTimeRaw
. This output should be saved to a file and the CPUTimeRaw column summed for total hours. CPUTimeRaw is equal to ncpus * elapsedraw.
How to check the HPCC node usage?
Users can see this information by simply running the node_status
command on
any dev node. We also offer a web-based dashboard at
https://icer.msu.edu/dashboard.
Storage and files
Quota
Quota issues writing to research spaces
Many users have reported problems copying or transferring files to their research space. Although their research space still has plenty of space, they still get the following error message:
failed to ... Disk quota exceeded
This problem may occur because you do not have your primary group set to match the research space or the folders which you copy or transfer files to have incorrect group ownership or no set-group-ID. Please read the instructions for using a research space, in particular, point 5.
Quota/file limit exceeded or general issues related to writing files (especially in home directories)
Begin by checking your storage usage with the quota
command.
If you find that you are over quota, please delete files, move them to another location (like a research space or if they are temporary, scratch space), or move them off of the HPCC. If this resolves the issue, then you may consider keeping your files in a different location or asking for more space in your home directory.
If you don't see a change in your quota or your quota is showing unrealistic numbers like negative or extremely large file counts, please contact ICER as this is likely the result of an unresolved issue with one of the HPCC's storage systems.
Sometimes exceeding your quota can stop you from being able to login or access systems like OnDemand because they require writing to a small file. If this is the case, please contact ICER
My files in the scratch space are gone?
Files in scratch are automatically purged if the last changed time is older than 45 days. Note that the scratch spaces are not intended for long-term storage. Files saved in scratch have no back-up.
How do I copy files from/to my MS One Drive/Google Drive?
Rclone is currently installed on the HPCC. This software supports research in the cloud and helps HPCC users to sync files and directories between MSU’s HPCC and their cloud storage, including OneDrive and Google Drive. Please refer to Rclone
What is HPCC's data protection policy?
All of the HPCC's shared storage systems are protected against individual drive and storage node failure (using RAID and highly availabile, active-active servers.)
We maintain an offsite disaster recovery system for users' home and research directories. We do not archive users' scratch spaces nor the persistent 'nodr' space.
Our goal is to maintain hourly snapshots for the last 24 hours and 60 days of file history on the disaster recovery servers. However, when there is a significant amount of data written, there may be a delay in copying updated data to the disaster recovery servers. Users that have hard requirements for should consider using MSU's Data Storage Finder.
Users may request older versions of their files via the contact forms.
Does HPCC offer a cheaper long-term archiving plan?
We do not. However, MSU offers the Data Storage Finder (https://data-storage-finder.tech.msu.edu, on-campus only). There are several possible options for data archiving.
Submitting jobs and running code
I have a buyin account, do I need to specify it when I submit jobs?
No, unless your PI has requested that it be opt-in instead of the default. When submitting a job without specifying an account, your default account is used. You can check your default account using the "buyin_status -l" command; buyin user's default is their buyin account. We recommend you read this if you have purchased buyin nodes.
Do you support running GPU jobs?
Yes. There are three GPU dev-nodes and a series of compute nodes in the cluster; see Cluster resources.
What does the message "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions" mean after my job is submitted?
Once a job is submitted the scheduler adds it to the calculations and continues to update the status of the job as the system works. The status for a job will reflect the current state of the scheduler, so you will see this message update once the scheduler has found a place to put the job. There are always some nodes which are down or drained in the cluster due to normal maintenance, but the "reserved for jobs in higher priority partitions" is the important part, and simply indicates that the scheduler has not yet found a time to schedule the job. This will update as the scheduler continues to function.
Why did I get an "Illegal Instruction" error?
This is usually because a program was compiled on a newer CPU
architecture (e.g., intel18) but then run on an older one (e.g.,
intel14). Our system has a range of CPUs, and the newest versions
support new instructions not available on the older CPUs. One short-term
fix is to run programs on the same CPU that they were compiled on. Based
on our experience, this error has occurred only on intel14 nodes and
therefore you need to avoid them. That is, for dev-node testing, pick
one from dev-intel16, dev-intel16-k80 and dev-intel18. For job
submission, add #SBATCH--constraint="[intel16|intel18]"
in your SLURM
script.
Software and modules
I want to install software packages, what should I do?
The HPCC has a lot of software installed already. Search for the software you
want to install using module spider <software name>
, then follow the
instructions provided by the output to use module load
and use the software.
See our documentation on this subject here.
We have additional documentation on the module system
here.
If the software is not present, you can submit a ticket. However, we encourage users to install software on their own, if possible. The HPCC has provided numerous versions of compilers and libraries which should accommodate the vast majority of software across different fields.
If you are thinking of requesting the system-wide installation of a piece of software, we strongly recommend you check the following factors when submitting a request for software installation:
-
How popular is the software? If it is not a popular software, are there other users on HPCC who would also be using it? If you are the only one using it, we would recommend it be installed in your home directory.
-
What type of license agreement does the software have? Some software licenses may restrict use even when they are free. Examples include software with export control, specific end-user license agreement, etc. When software licenses restrict use, we typically recommend the user directly make an agreement with the software provider to obtain and install it in their home directory. If it will be used by a group of people, HPCC system administrators can help with setting up the group access in compliance with the license agreement.
-
Is the software well maintained and up-to-date? If the software you wish to install is legacy software or is not being well maintained, chances are its installation will require an older version of its dependencies as well. The effort to install this software may then be greater than the effort required to find an up-to-date software with the same, similar, or even better functionality. It may be time to consider transitioning to using a newer software.
Why did my "module load" command output errors?
There are many reasons that errors occur when you try loading a module.
However, the most common cause is that you have forgotten to run module
purge
. Sometimes, module spider
can also fail to find the module. Most
likely it's because your personal module cache is out of date. To clear it, run
rm -r ~/.lmod.d/.cache
.
What should I do when I cannot load modules?
See How to find and load software modules.
What is powertools?
The powertools module is a collection of software tools and examples that
allows researchers to better utilize HPC systems. Powertools was created to
help advanced users use the HPCC more effectively. To learn more about
powertools, run the command powertools
.
Python and Conda
How do I use Python on the HPCC?
There are two methods: users can install their own version of Python with Anaconda or use the versions of Python installed on the HPCC system. See here.
I have a Python conflict. What should I do to resolve it?
Upon login to a dev-node, a default module list will load automatically.
Since Python/3.6.4 is included in the list, it can interfere with a
user's conda environment. As a consequence, your program may not be able
to find packages installed in your conda environment even if it has been
activated. In other words, the program still picks up Python/3.6.4 in
the module system. The solution is to run module unload Python
before
activating the conda environment.
How do I deactivate Conda base environment?
Many users have reported that after a local installation of Anaconda on
the HPCC, their login prompt changes to something starting with (base)
-bash-4.2$
. This is because conda activates the default environment,
base
, upon startup. To disable this behavior, which often results in
conflicts with system defaults, users can run the following command:
1 |
|
I tried to start a Jupyter Notebook through OnDemand, but my job will not start or will not recognize my Conda environment
All Conda environments used with the Jupyter Notebook OnDemand app must have Jupyter installed. Without this, the OnDemand job status will stay stuck on
Your session is currently starting... Please be patient as this process can take a few minutes.
before moving to
For debugging purposes, this card will be retained for 6 more days
without giving the chance to start the notebook. Depending on the setup, the job may start, but the environment will not be properly recongized and the app will fall back to the default version installed on the HPCC.
To install Jupyter in your Conda environment on the command line, activate it first by running
1 2 |
|
and then run
1 |
|
I tried to use python matplotlib to plot, but got an error of "No module named '_tkinter'"
If you use the default python module
(/opt/software/Python/3.6.4-foss-2018a/bin/python
) on a dev-node, you need to
load the Tkinter module before using python in order to proceed without errors.
Run: module load Tkinter/3.6.4-Python-3.6.4
Getting help
Can you keep me posted on the current status of the HPCC?
Yes. Users are encouraged to follow the HPCC Announcements blog to keep updated on the status of HPCC (such as scheduled downtimes and urgent notices).
I am looking for help to troubleshoot my problem. How do I share my code/files with you?
We do not go to your directory to view files or test your code for that matter. Please send your files along with your reply to the ticket email.