Wednesday, April 2, 2025

Google Cloud’s A4X VMs: AI at Scale with NVIDIA GB200!

Reasoning models that use critical thinking and inference to tackle complicated issues represent the next frontier of artificial intelligence. Infrastructure that can manage large datasets and context windows, as well as provide quick, dependable answers, is necessary to train and support this new class of models. You need a system designed to handle as-yet-unknown requirements if you want to keep pushing the envelope.

With the help of the NVIDIA GB200 NVL72, a system that consists of 36 Arm-based NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs connected by fifth-generation NVIDIA NVLink, Google Cloud is thrilled to present the preview of A4X VMs. New levels of AI performance and accuracy are made possible by this integrated system, which directly addresses the high compute and memory requirements of reasoning models that employ chain-of-thought.

The only cloud provider currently offering both A4 virtual machines (VMs) with NVIDIA B200 GPUs and A4X VMs with NVIDIA GB200 NVL72 is Google Cloud.

Key A4X VMs features and capabilities

A4X VMs are based on a number of significant advancements that pave the way for the future of AI:

NVIDIA GB200 NVL72

With shared memory and high-bandwidth connectivity, this setup allows 72 Blackwell GPUs to operate as a single, cohesive computation unit. For instance, low-latency replies for multimodal reasoning across concurrent inference queries are made possible by this unified architecture.

NVIDIA Grace CPUs

To effectively checkpoint, unload, and rematerialise the model and optimiser state needed to train and service the biggest models, these bespoke Arm chips include NVLink chip-to-chip (C2C) interfaces to the Blackwell GPUs.

Enhanced training performance

A4X provides a 4X boost in LLM training performance over the A3 VMs with NVIDIA H100 GPUs, with more than 1 exaflop per GB200 NVL72 system.

Scalability and parallelization

Using the most recent sharding and pipelining techniques to optimise GPU utilisation, A4X VMs enable the deployment of models over tens of thousands of Blackwell GPUs. NVL72 racks are combined into single, rail-aligned, non-blocking clusters of tens of thousands of GPUs for Google Cloud’s high-performance networking, which is based on RDMA over Converged Ethernet (RoCE). This is about effectively scaling your most intricate models, not simply about size.

Optimized for reasoning and inference

With its 72-GPU NVLink domain, the A4X architecture is ideally made for low-latency inference, particularly for reasoning models that use chain-of-thought methods. Low latency is provided by the 72 GPUs’ capacity to share memory and workload (including the KVCache for long-context models), and the huge NVLink domain further improves batch size scalability and lowers TCO, allowing you to handle more concurrent user requests.

The Google Cloud advantage

Google Cloud’s AI Hypercomputer supercomputing architecture includes A4X VMs, which take advantage of Google Cloud’s data centre, infrastructure, and software know-how. With the help of AI Hypercomputer, A4X clients may benefit from:

Hypercompute Cluster

Large clusters of A4X VMs may be deployed and managed using Hypercompute Cluster, which combines networking, storage, and computation into one cohesive entity. For massive distributed workloads, this provides remarkably high performance and robustness while making complexity management simple. Particularly for A4X, applications may benefit from the high-bandwidth NVLink with Hypercompute Cluster’s topology-aware scheduling algorithm, which is aware of the NVL72 domains. Additionally, it offers observability across the DC networking fabric, NVLink network, and GPUs, along with NCCL profiling to assist infrastructure teams in promptly identifying and resolving problems.

High-performance networking fabric

Based on NVIDIA ConnectX-7 network interface cards (NICs), the Titanium ML network adapter is part of the A4X VMs. The Titanium ML adaptor provides the security and agility of Google Cloud without sacrificing the speed needed for machine learning applications. With RoCE, the A4X system provides non-blocking GPU-to-GPU transmission at a rate of 28.8Tbps (72*400Gbps). A4X’s network architecture is rail-optimized, which lowers latency for GPU collectives and boosts efficiency. Google Cloud expand to tens of thousands of GPUs in a single non-blocking cluster by combining NVL72 domains with its Jupiter network architecture.

Advanced liquid cooling

Google’s third-generation liquid cooling system cools A4X VMs. To avoid thermal throttling and preserve optimal computing performance, steady, effective cooling is necessary. It’s liquid-cooling system is founded on lessons learnt from years of international operations. A4X will be accessible in more Google Cloud regions, giving clients all around the world quicker access to this potent technology because it has conquered the challenges of implementing and maintaining liquid-cooled infrastructure at scale.

Software ecosystem optimization

Software selections are crucial, particularly for the A4X system with Arm-based hosts. In order to give you access to performance-optimized software, including libraries and drivers that are compatible with well-known frameworks like PyTorch and JAX, it has partnered with NVIDIA. To get you started with your inference and training workloads, keep an eye out for GPU recipes.

Native integration across Google Cloud

A4X makes it simple to interface with other Google Cloud services and applications.

Storage

Hyperdisk ML speeds up model load time by up to 11.9x when compared to popular alternatives, and A4X VMs are directly connected with Cloud Storage FUSE for 2.9x greater training throughput than native ML framework dataloaders.

Google Kubernetes Engine (GKE)

GKE and A4X VMs work together to maximise resource utilisation while scaling AI/ML training and handling workloads as part of Google Cloud’s Industry Leading Container Management platform. This combination opens up new AI performance possibilities by enabling the execution of extra-large-scale AI workloads with low-latency inference and workload sharing over 72 GPUs, with each cluster capable of supporting up to 65,000 nodes.

Vertex AI Platform

Vertex AI is a managed, open, and integrated AI development platform that speeds up AI projects. With access to Google’s most recent Gemini models, or a large selection of models and open models, you may quickly train, fine-tune, or implement machine learning models.

A strategic partnership

In order to expedite clients’ AI projects, NVIDIA DGX Cloud, a fully managed AI platform, will also soon be accessible on A4X VMs.

To train and implement AI models for particular applications and sectors, developers and researchers require access to the newest technologies. NVIDIA partnership with Google offers companies improved scalability and performance, allowing them to handle the most complex workloads in scientific computing, generative AI, and LLM while taking advantage of it’s user-friendliness and worldwide presence.

Clients like Magic have decided to use Google Cloud’s A4X VMs to develop their state-of-the-art models.

Building it’s next-generation AI supercomputer on Google Cloud in collaboration with Google and NVIDIA excites. Its models’ inference and training efficiency will be significantly increased by Google Cloud’s A4X VMs equipped with NVIDIA’s GB200 NLV72 system. Additionally, It provides us with the quickest time to scale and a robust suite of cloud services.

Choosing the right VM: A4 vs. A4X

Both A4 virtual machines (VMs) with NVIDIA B200 GPUs and, more recently, A4X VMs with NVIDIA GB200 NVL72 are available on Google Cloud.

A4X VMs (powered by NVIDIA GB200 NVL72 GPUs)

Designed specifically to train and support the most complex, extra-large-scale AI workloads, especially those that involve scenarios requiring vast parallelism, reasoning models, and big language models with extended context windows. The unified memory over a sizable GPU region makes this possible.

A4 VMs (powered by NVIDIA B200 GPUs)

For a variety of AI model architectures and workloads, such as training, fine-tuning, and serving, A4 offers outstanding performance and adaptability. A4 provides optimised performance advantages for a range of scaled training tasks and simple transfer from earlier generations of Cloud GPUs.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post