Skip to main content

Why Slurm’s Networking Needs Improvement

Historically, Slurm was deployed on clusters run by experts at all levels, where misbehaving components were a simple strike of the admin banhammer away from being on their best behavior. One of the most common problems for Slurm was an unbound `for()` loop that called `squeue` that could slow down Slurm on the entire cluster. Slurm no longer runs small enough clusters for admins to chase down every poorly behaved script or workload manager. Especially on transient clusters on the cloud, or in a federation of clusters where the node could be long gone before any admin ever gets an alert. As the numbers of nodes, users, and requests scale up, the Slurm controllers start to hit performance limits.

Changing Slurm’s Threading Design

Slurm’s prior communications model for handling incoming requests was to create a new process thread and then handle the request for the duration of the thread’s life. With one thread per connection, the thread will always wait for any I/O operations while holding all of the thread’s resources. This means another thread must be switched in the hardware context to do any work. In Linux, creating a new thread is an expensive operation that only becomes more expensive as a process has more memory, adding latency and leading to massive numbers of threads relative to the available number of hardware threads. In the case of slurmctld on a larger cluster, it took a substantial amount of time for the kernel to lock the process and copy over the memory as it scaled with the number of simultaneous requests.

One part of the conversion was to switch to a pre-allocated thread pool model. This means that all threads are created once at startup and never exit during the life of the process. When a new connection is accepted, the connection handler is then run by one of these threads, skipping the process-wide kernel freeze of allocating a new thread. With poorly behaved clients, the overhead of handling each new connection (and thread), which may not even result in a full RPC transaction being completed, could easily cause the slurmctld daemon to be overwhelmed.

Moving to Asynchronous Input/Output (I/O)

Several years ago, slurmrestd was added to Slurm to allow users to control Slurm without using a command-line interface (CLI). It was designed for the web world where every client is poorly behaved on a distant connection with non-trivial packet loss. slurmrestd was built to follow a newer model using a constant thread pool that serviced connections entirely asynchronously. Since any thread can process any connection, the handling remains fully asynchronously. This model trades some code complexity, latency to wait for a thread, and memory for buffering to allow processing more (many more) connections than the max thread count of the process. One of the lessons learned as Slurm has scaled is that thread counts have a sweet spot where more processing can be done, before the lock contention negates and eventually obliterates all the benefits.

The fully asynchronous change is that any part of Slurm converted to conmgr never waits for I/O. This reduces latency and thread count, and prevents unnecessary blocking of other operations.

Changes in Slurm-24.11 for conmgr

In Slurm-24.11, considerable work was done to prepare slurmctld for more intensive workloads and larger clusters, from increased user and node counts. One of the projects in this effort was to begin the conversion to conmgr in Slurm’s slurmctld daemon, with more conversions in development for future versions. The new conmgr subsystem, originally written for slurmrestd, will allow Slurm to handle asynchronous communications efficiently, providing exceptional performance for the most demanding HPC workloads.