Google Cloud A4 VMs
In order to execute complex model architectures on an ever-expanding variety of model sizes and modalities, modern AI workloads demand strong accelerators and fast interconnects. These intricate models require the newest high-performance computing technologies for inference and fine-tuning in addition to extensive training.
Current preview of A4 virtual machines (VMs) powered by the NVIDIA HGX B200, Google Cloud is thrilled to introduce the much awaited NVIDIA Blackwell GPUs to Google Cloud. With eight Blackwell GPUs coupled by fifth-generation NVIDIA NVLink, the A4 VM provides a notable performance improvement over the A3 High VM. Google Cloud A4 VMs are a flexible choice for training and fine-tuning for a variety of model designs since each GPU provides 2.25 times the peak compute and 2.25 times the HBM capacity. Additionally, the additional computation and HBM capacity makes A4 VMs ideal for low-latency serving.
Google Cloud A4 VMs Innovations
The Google Cloud A4 VMs combines Blackwell GPUs with Google’s infrastructure advances to provide Google Cloud users with the best possible cloud experience, from cost optimisation and simplicity of use to scale and performance. Among these innovations are:
Improved networking
Building on NVIDIA ConnectX-7 network interface cards (NICs), A4 virtual machines (VMs) are constructed on servers equipped with Google Cloud’s Titanium ML network adapter, which is optimised to provide a safe, high-performance cloud experience for AI workloads. A4 virtual machines (VMs) provide non-blocking 3.2 Tbps of GPU-to-GPU traffic using RDMA over Converged Ethernet (RoCE) when paired with Google Cloud’s datacenter-wide 4-way rail-aligned network. With Google Cloud’s Jupiter network fabric, which has a bi-sectional bandwidth of 13 Petabits/sec, customers may grow to tens of thousands of GPUs.
Google Kubernetes Engine
The most scalable and fully automated Kubernetes solution for clients looking to deploy a reliable, production-ready AI platform is GKE, which supports up to 65,000 nodes per cluster. Google Cloud A4 VMs come pre-configured with native GKE integration. GKE provides a stable environment for the distributed computing and data processing that support AI workloads by integrating with other Google Cloud services.
AI Vertex
Vertex AI, a fully managed, unified AI development platform for creating and using generative AI, which is driven by the AI Hypercomputer architecture below, will make A4 virtual machines available.
Open-source software
Google Cloud collaborate closely with NVIDIA to optimise JAX and XLA in addition to PyTorch and CUDA, allowing collective communication and processing on GPUs to overlap. Google Cloud has included sample scripts and optimised model settings for GPUs with XLA flags enabled.
Hypercompute Cluster
With close GKE and Slurm interaction, Google Cloud’s new highly scalable clustering technology simplifies workload and infrastructure provisioning as well as the continuous operations of Artificial Intelligence supercomputers.
Various models of consumption
Google Cloud reimagined cloud consumption for the specific requirements of Artificial Intelligence workloads with Dynamic Workload Scheduler, which offers two modes for distinct workloads: Calendar mode for predictable job start times and durations, and Flex Start mode for improved obtainability and better economics. These modes complement the On-demand, Committed use discount, and Spot consumption models.
Google Cloud A4 VMs will be used by multi-asset-class quantitative trading company Hudson River Trading to train its next capital market model research. With its high-bandwidth memory and improved inter-GPU connection, the Google Cloud A4 VMs is perfect for handling the demands of complex algorithms and greater datasets, which will speed up Hudson River Trading’s response to market changes.
Utilising A4, which is fuelled by NVIDIA’s Blackwell B200 GPUs, excites us. To enable low-latency trading choices and improve Google Cloud’s models across markets, Google Cloud must run their workload on state-of-the-art AI infrastructure. Google Cloud is excited to use the advancements in hypercompute cluster technology to speed up the training of their most recent models, which provide quant-based algorithmic trading.
In order to provide Google Cloud’s most cutting-edge GPU-accelerated AI infrastructure to clients, NVIDIA and Google Cloud have a long-standing collaboration. Google Cloud is thrilled that the B200 GPU is now accessible with the new Google Cloud A4 VMs as the Blackwell architecture is a huge advancement for the AI sector. Google Cloud is excited to see how users expand their AI missions with the new Google Cloud service.
A4 virtual machines and hypercompute clusters work better together
Accurate and scalable infrastructure resource orchestration is necessary for expanding AI model training. These workloads frequently span thousands of virtual machines, straining networking, storage, and processing capabilities.
These enormous clusters of Google Cloud A4 VMs may be deployed and managed using Hypercompute Cluster, which combines networking, storage, and computation into one cohesive entity. For massive distributed workloads, this provides remarkably high performance and robustness while making complexity management simple. The purpose of Hypercompute Cluster is to:
- Provide great performance by tightly packing A4 virtual machines together to provide for the best possible workload distribution.
- With GKE and Slurm, which are loaded with clever features like topology-aware scheduling, you can maximise resource scheduling and workload performance.
- Boost dependability with proactive health checks, automatic failure recovery, and integrated self-healing features.
- Improve monitoring and observability to get timely, personalised information.
- Automate scaling, provisioning, and setup while integrating Slurm and GKE.
Google Cloud is the first hyperscaler to reveal that an NVIDIA Blackwell B200-based product is available for preview. A4 virtual machines and hypercompute clusters work together to facilitate the development and implementation of AI solutions for businesses in all sectors.