Slurm 24.11 is now available!
This version builds on SchedMD’s new six-month release cycle, intended to deliver updates and features more quickly, making it easier for users to align upgrades with scheduled maintenance windows.
24.11 enriches Slurm’s best features adding enhancements to scalability, security, and resource management. Some key features include:
Improved I/O and RPC Handling
A standout upgrade in Slurm 24.11 is the revamped handling of remote procedure calls (RPCs). The slurmctld daemon’s I/O processing now uses a new “conmgr” system, which streamlines RPC handling. Replacing the traditional thread-per-RPC model with a more scalable worker thread system ensures better performance, improved scalability, and greater reliability under heavy workloads.
Simplified Resource and Accounting Policies
Slurm 24.11 introduces new Quality of Service (QOS) options to allow administrators to set relative limits as a percentage of total cluster capacity rather than absolute numbers, offering a more dynamic and flexible way to manage workloads. Coupled with updates to Slurm accounting, users gain clearer insights into job performance and resource utilization.
Slurm 24.11 isn’t just an upgrade, it’s another step in the work SchedMD is doing to help users orchestrate complex workloads with efficiency and security. Whether through improved concurrency, secure communication protocols, or refined GPU handling, this release ensures that clusters – big and small – can scale with confidence!
New GPU Plugin
With GPU powered workloads continuing to dominate HPC environments, Slurm 24.11 introduces the gpu/nvidia plugin. This plugin does not rely on any NVIDIA libraries and will build, by default, on all systems. It supports basic GPU detection and management, but cannot currently identify GPU-to-GPU links or provide usage data, as these are not exposed by the kernel driver.
Enhanced Job Scheduling
Slurm 24.11 introduces the ability to submit jobs against multiple QOSes, offering users greater flexibility for managing diverse workloads. Allowing jobs to quality for different QOS levels, means administrators can optimize resource allocation and maximize cluster efficiency. Additionally, a new experimental “oracle” backfill scheduling algorithm delays jobs if the oracle function determines reduced fragmentation of the network topology is advantageous. Together these features provide users with smarter and more adaptable scheduling options.
Improved Command Syntax and Tools
The introduction of a “hostlist function” syntax simplifies both management commands and configuration file edits, making routine tasks more efficient for administrators. Slurm 24.11 also includes the addition of “scontrol listjobs” and “liststeps” commands. These commands complement the existing “listpids” tool, providing detailed job and step data in user-friendly JSON and YAML formats. These improvements streamline cluster operations and enhance transparency for both administrators and users.
Get Started with 24.11
Slurm 24.11 balances performance with flexibility. For the full list of features included in this release, view the Release Announcement. Slurm documentation has been updated to 24.11 and older versions can now be found in the archive. Slurm downloads are available here.