Efficient Management of AI Workloads on Supercomputers

Source
Efficient Management of AI Workloads on Supercomputers

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, based on the NVIDIA Blackwell architecture, are rack-scale supercomputers. They are designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit. For AI architects and HPC platform operators, the challenge is not just racking and stacking hardware but transforming infrastructure into safe, high-performance, and user-friendly resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions creates much of the operational complexity. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system's hierarchical and topology-sensitive design.

This gap is where a validated software stack like NVIDIA Mission Control comes into play. Mission Control provides rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems. With a native understanding of NVIDIA NVLink and NVIDIA IMEX domains, it integrates with workload management platforms like Slurm and NVIDIA Run:ai. These capabilities will also be supported for the NVIDIA Vera Rubin platform, including NVIDIA Rubin NVL8. This article demonstrates how Mission Control, Slurm, and NVIDIA Run:ai turn advanced GPU architecture concepts, such as NVLink and IMEX domains, into an operational AI factory that is scalable, schedulable, and easy to manage.

The core challenge lies in how rack-scale topology meets AI workload scheduling. At a physical level, the GB300 NVL72 and GB200 NVL72 systems are powerful, sophisticated systems. Each delivers a dense GPU fabric connected by NVLink switches, supports NVIDIA Multi-Node NVLink (MNNVL) within the rack, and includes IMEX-capable compute trays that enable shared GPU memory across nodes. However, schedulers do not operate at the level of switches and fabrics. They require predictable allocation of discrete GPU resource pools, clear isolation boundaries to protect workloads from one another, and consistent performance characteristics that match user expectations.

Under the hood, the NVLink topology of a Grace Blackwell NVL72 rack is reflected up the software stack through a pair of system-level identifiers: cluster UUID and clique ID. These identifiers encode a GPU's position in the NVLink fabric—across domains or racks—in a way that system software, schedulers, and higher-level tools can reason about. The mapping is straightforward: Cluster UUID corresponds to the NVLink domain, while Clique ID corresponds to the NVLink partition. A shared cluster UUID means that systems—and their GPUs—belong to the same NVLink domain and are connected by a common NVLink fabric. On Grace Blackwell NVL72, this UUID is consistent across the entire rack: all GPUs in the same NVL72 rack report the same cluster UUID.

The clique ID provides a finer-grained distinction. GPUs that share a clique ID belong to the same NVLink Partition within that domain. When a rack is carved into multiple NVLink partitions, the cluster UUID remains the same—because the GPUs live in the same physical NVLink domain—but the clique IDs differ to reflect the logical partitioning of the fabric. From an operational perspective, this distinction matters: Cluster UUID answers the question of which GPUs physically share a rack and are capable of NVLink communication, while Clique ID answers which GPUs share an NVLink Partition and are intended to communicate together for a given workload or service tier.

These identifiers form the connective tissue between hardware topology and scheduling logic. They enable platforms like Slurm, Kubernetes, and NVIDIA Run:ai to align job placement, isolation, and performance guarantees with the actual structure of the NVLink fabric without exposing that complexity directly to end users. As you start running multi-node workloads on Blackwell-based NVL72 systems, placement becomes as important as GPU count. A 16-GPU job spread across the wrong nodes can behave very differently from the same job confined to a single NVLink fabric. This is where Slurm's topology/block plugin becomes essential, enabling Slurm to recognize that not all nodes are equal.

Related articles