Networking For AI workloads
Google Cloud’s goal at its is to simplify the process of integrating AI models into its infrastructure. Google Cloud examines how the Cross-Cloud Network solution helps your Artificial Intelligence workloads.
Managed and Unmanaged AI options
Networking For AI workloads, Google Cloud offers both managed (Vertex AI) and do-it-yourself (DIY) options.
AI Vertex
A machine learning platform that is completely controlled. Vertex AI provides access to third-party models via Model Garden in addition to pre-trained Google models. Vertex AI is a managed service that takes care of infrastructure administration so you can focus on inferencing, training, and fine-tuning your AI models.
Custom infrastructure deployments
Depending on the kind of workload the user is handling, these installations make use of different computing, storage, and networking choices. One method for deploying Networking for AI workloads that operate on TPUs or GPUs as well as HPC tasks that might not require them is to use Artificial Intelligence hypercomputers.
Networking for managed AI
You don’t need to be concerned about the supporting infrastructure while using Vertex AI. By default, the service is reachable through a public API for network connectivity. Private Service Access, Private Google Access, Private Service Connect endpoints, and Private Service Connect for Google APIs are the options available to businesses wishing to employ private connection. The Vertex AI service you are utilising will determine which option you select. Additional information is available in the on-premises and multicloud documentation for Accessing Vertex AI.
Networking AI infrastructure deployments
An organisation wants to set up an AI cluster with GPUs on Google Cloud, but their data is on another cloud. Let’s examine one example.
You must examine the networking according to planning, data ingestion, training, and inference in light of this requirement.
Organising
Determining your needs, the cluster’s size (number of GPUs), the kind of GPUs required, the preferred deployment area and zone, storage, and expected network bandwidth for transfers are all part of this important first step. The next stages are informed by this planning. For example, fine-tuning smaller models takes a far smaller cluster than training huge language models like LLaMA, which have billions of parameters.
Data ingestion
Since the data is stored in a different cloud, you will need a fast connection in order to view it directly or move it to a Google Cloud storage option. Cross-Cloud Interconnect makes this possible by providing a high-bandwidth direct connection with a choice of 10Gbps or 100Gbps per link. Another option is to utilise Cloud Interconnect if the data is on-premises.
Training
Lossless cluster networking, low latency, and high bandwidth are required for training workloads. Remote Direct Memory Access (RDMA) permits GPU-to-GPU communication without OS intervention. In specific network VPCs that use the RDMA network profile, Google Cloud networking supports the RDMA over converged ethernet (RoCE) protocol. For optimal performance, nodes and clusters should be as near to one another as feasible because proximity is crucial.
Inference
Low-latency communication to endpoints is necessary for inference, and this may be made available through connectivity solutions such as Private Services Connect, Cloud VPN, Network communication Centre (NCC), and VPC network peering.
In the aforementioned example, Google Cloud use:
- To satisfy the need for a fast connection, use Cross-Cloud Interconnect to connect to Google Cloud.
- RDMA networking with RoCE, since Google Cloud have scheduled needs and wish to optimise Google Cloud’s accelerators.
- Google Cloud’s cluster will be deployed using Google Kubernetes Engine (GKE) as a computing option.
Accelerating the Enterprise AI Journey with Cross-Cloud Network
According to IDC, multicloud networking systems and services would be widely adopted by enterprises worldwide in 2024, spanning many use cases and industrial verticals. These networks are essential for accelerating digital transformation and guaranteeing businesses long-term returns on their IT expenditures.
Customers may use Google Cloud’s Cross-Cloud Network to:
- Utilise service-centric connectivity to streamline their multicloud networks.
- Secure their user traffic, data, and workloads consistently with real-time, ML-powered security.
Regarding the design, planning, construction, and operation of multicloud networks, this IDC whitepaper offers insights into key business objectives, IT issues, and industry trends.