NVIDIA H100 instance profiles are now available on IBM Cloud, according to a recent announcement from IBM. These instances are prepared and tailored to handle a variety of AI tasks, including training, fine-tuning, and large model inferencing. However, the requirement to scale up the infrastructure over several nodes is growing along with the implementation of artificial intelligence (AI).
To facilitate multi-node scaling, they are now launching Cluster Network service, a new top-level service. IBM’s Vela infrastructure serves as the basis for this service. It took the necessary actions to externalise this capability and make it accessible to anyone wishing to scale out their AI workloads after optimising network for internal workloads.
What is AI Cluster Network?

It is commonly recognised that the first network is the “Cloud Network”. A typical full-fledged network, Cloud Network offers all the features needed for IBM Cloud, including Transit Gateways, Security Groups, and Network Access Control Lists. Additionally, it gives you access to all of the infrastructure required to serve your workloads, including Cloud Object Storage, File Storage, Block Storage, and more. You can access the IBM Cloud Infrastructure efficiently over the Cloud Network.
The NVIDIA H100-based server’s GPUs are able to connect to one another extremely quickly. The NVIDIA NVLink fabric offers a fast, point-to-point connection between the GPUs and extends straight into the virtual machine. This fabric is what IBM calls the “native accelerator fabric.”
Every NVIDIA H100 Tensor Core GPU is efficiently and additionally coupled with an NVMe SSD and a high-speed NIC on each PCIe bus within the server.
This leads us to the last network, the dedicated Cluster network, which enables high-speed channel communication between the GPUs. Scaling the network out is essential for workloads involving AI training or fine-tuning. The nodes can communicate directly with one another with this dedicated set of channels.
Integrating several backend cluster network for various solutions is one of IBM’s main cluster networking architecture tenets. The requirements for other cluster networks may differ significantly from those for an NVIDIA H100 system. IBM Cloud preserves the ability to implement high-speed, fit-for-purpose networks for workloads while retaining a comprehensive feature set through the cloud NIC by separating the cluster network abstraction.
NVIDIA H100 Cluster Network design principles
The traffic patterns of AI and cloud networks are essentially different. They are often point-to-point, high-bandwidth, low-entropy traffic. Workloads for AI training follow micro-stampede traffic patterns. Making ensuring that performance is attained is top objective.
IBM found that redundancy and resilience in the AI network are crucial, even while performance is crucial. It wanted to make sure that the workload may slow down instead of returning to its previous checkpoint and losing the work in the event of a cluster NIC link failure.
Delivering on performance
Eight 400 Gbps dedicated NVIDIA ConnectX-7 NICs are installed on AIcluster-enabled NVIDIA H100 servers. The network’s total throughput is 3.2 Tbps. That network supports workloads based on NVIDIA NCCL and is fine-tuned for RoCE v2 and NVIDIA MLNX_OFED drivers. Compared to initial internal Vela supercomputer, this has eight times the throughput. Eight NICs with RoCE GDR and four or more queue pairs per NIC between two servers may reliably achieve 3.1 Tbps in aggregate throughput. This is 97% close to the theoretical maximum of 3.2 Tbps, as determined by NVIDA’s perftest bandwidth test.
NVIDIA GPU Direct and other technologies are made available to users by combining the accelerator, NVMe, and NIC into a single PCIe link. By using this technology, the GPU can speak with the cluster NIC or NVMe directly, bypassing the CPU in the process. The GPU and the NIC have full bi-directional bandwidth.
IBM has achieved near line rate on many of workloads on the physical backend. To fine-tune and optimise the isolation models, switch buffers, and other components, IBM Cloud, IBM Research, and partners collaborated closely.
To guarantee that ECMP hash collisions are greatly decreased, they have also implemented a system that logically isolates traffic flows. At its destination, each line is tuned to send traffic to a particular peer. This gives it cloud backend a “rail-like” design.
Delivering on resilience
Workloads involving AI are quite prone to failure. Workloads have a tendency to roll back to checkpoints if a component in your cluster fails. Even if checkpoint restarts are becoming less frequent and automated, work is still lost. Users may lose a significant amount of effort based on how aggressive they are when defining their checkpoint intervals.

IBM objective is to lessen operational problems such as link failures. It can provide a more robust cluster network with Software-Defined Network solution. A dual-port NVIDIA ConnectX-7 NIC with two 200 Gbps ports is installed on each server.
In the NVIDIA H100 instance, the SDN layer combines those two ports into a single 400 Gbps VF. An NVIDIA H100 cluster network can be configured with 1×400, 2×200, or 4×100 cluster NICs per cluster. The underlying traffic is dispersed among the two physical lines, regardless of the customer’s arrangement. In the event of a link problem, traffic within the cluster slows down rather than fails.
IBM applied this idea to backend network as well. The logical rail architecture reroutes traffic appropriately in the event that a link between two switches breaks. In the event that a link problem arises, traffic may slow down as a result of the decreased bandwidth capacity rather than failing. No slowdown is anticipated if the NIC’s traffic is not exceeding 200 Gbps.
The spine-leaf architecture used by the NVIDIA H100 network is designed to safeguard both layers through resilience. Network path failover happens whether there is a problem at the spine or the leaf.
IBM had to use a method in its aggregation layer to accomplish this redundancy while maintaining performance standards.
A pair of leaf switches eight per server are connected to each dual-port NVIDIA ConnectX-7 NIC. A group of aggregation switches is connected to each leaf switch. A Virtual Rail is established within every aggregation switch. By doing this, the send and receive sides of the queue pairs are kept balanced. When compared to a conventional ECMP model, this significantly enhances performance in tests.
IBM also used a technique called Virtual Rail Redundancy. Every rail is set up to have an optimised failover path to another rail in the event of a link loss.
In order to enhance the distribution of flows over the aggregation switch channels, the leaf switches also employ unique algorithms to balance traffic up to the aggregation switches. When a particular link from leaf to aggregate is determined to be congested, traffic is dynamically redistributed. On an open link, the specified flow will dynamically rebalance.
Keeping it simple
IBM staff wanted the experience to be as easy as possible. To support this specialised network, developed a new Cluster Network Service. Although it can assign IPs, attach to the instances it supports, and build cluster network subnets, this service purposefully has few functionalities.
Isolation, performance, and resilience are the three main areas of focus. The cloud NIC provides access to the wider capabilities. Cluster NICs, which provide the performance, are provided by this new service.
Users must provision a minimum of eight cluster NICs in order to utilise cluster networking on their NVIDIA H100 instances. This will assist in making sure that the underlying physical infrastructure is properly distributed. If users wish to raise the entropy on the backend, they can deploy either 8, 16, or 32.
Cluster NIC Count | Cluster NICs per GPU | Effective Bandwidth per Cluster NIC | Effective Bandwidth per GPU |
8 | 1 | 400 Gbps | 400 Gbps |
16 | 2 | 200 Gbps | 400 Gbps |
32 | 4 | 100 Gbps | 400 Gbps |
It’s difficult to construct eight cluster NICs, but it would be quite intimidating for power users who want to create more. IBM designed its user interface to make this process easier.
Establishing the user’s cluster network is the initial stage. A set of Cluster Network Subnets must also be created by the user within the Cluster Network. The UI generates these subnets on the user’s behalf because they are frequently strikingly identical. The subnets can also be manually configured by individuals who desire more control.
Following the creation of the subnets, NVIDIA H100 server instances that connect to them must be provided. The NVIDIA H100 instance profiles can be added to a cluster network from the provisioning page. By doing this, your VSI will have the appropriate set of Cluster Network Attachments.
SR-IOV is used by Cluster Networks on the NVIDIA H100 instance to provide performance. Therefore, an instance can only have Cluster Networks added to it when it is being provisioned or halted.
Eight subnets make up my cluster network in the aforementioned scenario. As a result, once provided, each GPU will have a 400 Gbps VF connected to the instance.
Using the Cluster Network
The cluster network will appear on the PCI bus as Virtual Functions (VF) from within the virtual machine. NVIDIA ConnectX-7 NIC-based network cards are utilised as the foundation.
There are several separate blocks visible. The boot disc, cloud NICs, and data volumes are connected to a primary section. After that, there are parts for each GPU where the NVIDIA H100 GPU, NVIDIA ConnectX-7 VF, and associated Instance Storage NVMe disc are connected. Each block will include two or four VFs if the user selects extra VFs (such as 16 or 32).
Users must make sure they have the NVIDIA MLNX_OFED drivers installed in order to use these VFs with RDMA. NVIDIA advises using them in order to make use of their RDMA network and have close integration with their NCCL backend.