Unbelievably simple to use, the AI Hypercomputer is a fully integrated supercomputing architecture for AI applications. It outline four typical AI Hypercomputer use cases in this blog, including tutorials and reference architectures, which are just a handful of the numerous applications for AI Hypercomputer that are now available.
AI Hypercomputer use cases
Let’s dive deeper into each AI Hypercomputer use cases:

Reliable AI inference
In 2023, Google experienced around three times less outage hours than Azure and three times fewer than AWS, according to Futurum. Although the figures change over time, everyone finds it difficult to maintain high availability. For high-reliability inference, the AI Hypercomputer architecture provides fully integrated capabilities.
Due to its 99.95% pod-level uptime SLA, GKE Autopilot is the first choice for many clients. By following security best practices and autonomously managing nodes (provisioning, scaling, upgrades, and repairs), Autopilot improves reliability while relieving you of human infrastructure responsibilities. Together with resource optimisation and integrated monitoring, this automation reduces downtime and ensures the safe and efficient operation of your apps.
Although there are a number of possible configurations, in it reference architecture it use SSDs (such as Hyperdisk ML) to speed up the loading of model weights, together with JAX, GCS Fuse, and TPUs with the JetStream Engine to speed up inference. As you can see, service extensions and custom metrics are two significant stack changes that help us reach high reliability.
- By adding your own code (written as plugins) to the data path, service extensions let you alter the behaviour of Cloud Load Balancer and enable more sophisticated traffic control and manipulation.
- Applications can transmit workload-specific performance data (such as model serving latency) to Cloud Load Balancer through custom metrics that leverage the Open Request Cost Aggregation (ORCA) protocol. The Cloud Load Balancer uses this data to make intelligent routing and scaling decisions.

Large scale AI training
Large-scale, effectively scaled computing is required for training AI models. Using a single API request, Hypercompute Cluster, a supercomputing solution based on AI Hypercomputer, enables you to deploy and manage several accelerators as a single unit. Hypercompute Cluster stands out for the following reasons:
- For ultra-low-latency networking, clusters are physically co-located in close proximity to one another. They include cluster-level observability, health monitoring, and diagnostic tools, as well as pre-configured and proven templates for dependable and repeatable deployments.
- Hypercompute Clusters are launched using the Cluster Toolkit and are made to integrate with orchestrators such as GKE and Slurm to make maintenance easier. To train a single machine learning model, GKE supports more than 50,000 TPU chips.
Google Cloud make use of A3 Ultra VMs and GKE Autopilot in it reference architecture.
- It think that GKE’s support for up to 65,000 nodes is more than ten times larger than that of the other two biggest public cloud providers.
- In comparison to A3 Mega GPUs, A3 Ultra employs NVIDIA H200 GPUs, which have twice the high bandwidth memory (HBM) and twice the GPU-to-GPU network capacity. For big multi-node workloads on GPUs, they are designed with new Titanium ML network adapter and NVIDIA ConnectX-7 network interface cards (NICs) to provide a high-performance, secure cloud experience.

Affordable AI inference
Large language models (LLMs) in particular can become unaffordable to serve. To save expenses, AI Hypercomputer uses a variety of specialised hardware, open software, and flexible consumption patterns.
- If you know where to search, you can find cost savings everywhere. You should be aware of two cost-effective deployment models in addition to the tutorials. Spot VMs can save up to 90% on batch or fault-tolerant processes, while GKE Autopilot can cut container running costs by up to 40% when compared to normal GKE by automatically scaling resources based on real demands. GKE Autopilot offers “Spot Pods” to let you save even more money by combining the two.
Following JAX training, It switch to NVIDIA’s Faster Transformer format for inferencing in this reference architecture. NVIDIA’s Triton on GKE Autopilot serves optimised models. A pre-built NeMo container makes setup easier, and Triton’s multi-model capability makes it simple to adapt to changing model topologies.

Easy cluster setup and deployment
You need technologies that make setting up your infrastructure easier, not more difficult. For quick and consistent cluster installations, the open-source Cluster Toolkit provides pre-built components and blueprints. PyTorch, Keras, and JAX integration is simple. Platform teams benefit from a variety of hardware options, flexible consumption models like Dynamic Workload Scheduler, and easier management using Slurm, GKE, and Google Batch. It installed Slurm on an A3 Ultra cluster in this reference design:
