(2023-08-30) Lab Notebooks: Incident Report on System Slowdown due to I/O

Warning

This is as a Lab Notebook which describes how to solve a specific problem at a specific time. Please keep this in mind as you read and use the content. Please pay close attention to the date, version information and other details.

Incident report on system slowdown caused by excessive file I/O

On the afternoon of August 28th, 2023, ICER staff received reports about slowdown across multiple development nodes on HPCC. After investigating the issue, we identified a series of jobs that were saturating the file systems transaction capacity. Cancelling these jobs resolved the slowdown and further investigation found previous runs of these jobs are correlated with reported slowdown incidents since late July.

Based on our own investigation and working with the user to correct the behavior of the causative jobs, we have identified a number of factors which we believe contributed to the slowdown. We will discuss these below both as an explanation of the incident and as reference for HPCC users to review in order to avoid creating jobs with similar issues in the future.

Factors contributing to slowdown on HPCC

1. Multiple I/O operations on small files in the same folder from many nodes

The cause of the slowdown was an excessive number of file input/output (I/O) operations that saturated the file systems transaction capacity, due to jobs frequently recording their state during the run. However, the impact on the file system was magnified by a number of other issues:

Reading and writing to many small files simultaneously in general requires more transactions than a single, large file
Reading and writing to many files in the same directory, especially the same file, creates additional overhead on the file system to lock and check files
Reading and writing to the same folder from multiple nodes further increases the overhead
Reading and writing to a folder where disaster recovery is enabled (home directories), further increases the overhead again

We emphasize the above points to illustrate how different facets of the HPCC can magnify the effect of I/O operations and would make the following recommendations to users developing and using software on our system:

Organizing input and output files for jobs in multiple folders can help reduce impact on the file system
Non-essential/intermediate files can be stored in scratch space which is not backed up in disaster recovery
Writing output in larger chunks to fewer files less frequently reduces the overall number of file transactions
When checkpointing the state of a job, please consider how frequently the job state must be sampled to track significant changes. For example, for systems that converge to a solution, consider dynamically adjusting when the state is saved to be longer as the system converges

2. Running the maximum of jobs for long periods of time

Users can queue up to 1000 and run up to 520 jobs at one time (except the scavenger queue). However, running 500+ jobs constantly for a long period of time increases the risk of compounding problems on HPCC. In this case, running multiple simultaneous jobs contributed both to the overall number of I/O operations and the overhead of operating out of the same folder for all jobs. Additionally, past cases where we have had to put holds on a user's account for overuse or unwanted behavior have often involved running a larger number of jobs with a looping process in each job.

We do not want to discourage users from making use of HPCC resources to their fullest, but would have the following recommendations if you plan to queue a larger number of jobs such that you will have the maximum number of jobs running for a long period:

Test and benchmark a single run of your jobs before queuing hundreds of instances. Proper testing and benchmarking will help you accurately estimate the resources you need per instance and avoid submitting hundreds of jobs to the queue only to have them fail.
Do some arithmetic for all your jobs: How many files will 1000 instances of this job take? How much disk space will the output occupy? How much of the yearly allotment of your CPU/GPU hours will these jobs use?
Consider setting the working directory independently for each job to reduce the overhead on the filesystem. Alternatively, you can copy your input data to /mnt/local within a job, which will store it in a temporary folder on the compute node (NOTE: since this a temporary folder, be careful to COPY not MOVE as files here will not be preserved after the job completes)