Large language models (LLMs) and machine learning (ML) are advancing entire sectors in the field of artificial intelligence. However, to train their models effectively with larger models and datasets, developers require Distributed training environments that span several AI accelerators and numerous compute hosts. Scalability, resource management, and orchestration issues may result from this.
They are available to assist. To make distributed, large-scale training easier, Google Cloud‘s AI Hypercompute architecture includes a powerful array of GPU and TPU resources in addition to sophisticated orchestration capabilities. We’ll walk you through the Google Cloud GPU accelerator orchestration tools in this blog post so you can scale and optimise your machine learning processes.
Choose the right accelerator family
The choice of GPU is a crucial component of distributed training. Specialised machine families from Google Cloud provide customised solutions for a range of performance and cost requirements. With NVIDIA H100 and NVIDIA H200 (upcoming) GPUs, the A3 machine series offers robust GPU-to-GPU bandwidth, making it an excellent choice for demanding training tasks. On the other hand, the A2 machine series with NVIDIA A100 GPUs is built for situations like efficient, single-node training that call for less inter-node communication. Furthermore, the flexible G2 machine family with NVIDIA L4 GPUs offers the adaptability required for tasks involving inference and testing.
In order to accommodate the demands of extensive training, additionally provide a variety of GPU usage models:
- Committed Use Discounts (CUDs) offer assured capacity and substantial cost savings in exchange for a sustained commitment.
- The two modes of Dynamic Workload Scheduler (DWS) are intended for different workloads that require certainty or allow for flexibility about start time; the capacity is available for a specified period of time and is provided at a reduced list price.
- Although capacity availability is not assured, on-demand usage offers the greatest flexibility with no upfront commitments.
- Although spot virtual machines (VMs) are far less expensive, they are preemptible and necessitate job designs that are durable and tolerant of disruptions.
Examine three potent orchestration techniques on Google Cloud Google Kubernetes Engine (GKE), Cluster Toolkit, and Vertex AI custom training pipeline to further speed up your distributed training. Every strategy has its own advantages, allowing you to take advantage of Google Cloud’s strong infrastructure to advance your machine learning projects in a timely and scalable manner.
To further grasp how Google Cloud’s sophisticated orchestration tools can help you maximise resources, simplify your projects, and attain high performance in your machine learning endeavours, let’s go over each of the available alternatives.
Option 1: GKE for unified workload management
For easier management, businesses with strong platform teams frequently seek a single environment to operate all of their workloads, including customised training. In this situation, GKE makes sense since it offers the scalability and flexibility needed to manage various workloads on a single platform. Platform teams may streamline administration and optimise resource usage while gaining centralised control and visibility with GKE.
Here’s how to coordinate GKE-based ML workloads:
GKE cluster and nodepool provisioning
If you would like to utilise Terraform and have a reservation (CUD or DWS calendar), specify the parameter file (terraform.tfvars) and adhere to the guidelines in the cluster provisioning templates:
cat >./terraform.tfvars <<EOF
project_id = “PROJECT_XXXX”
resource_prefix = “a3mega-test”
region = “us-east4”
node_pools = [
{
zone = “us-east4-a”
node_count = 2
compact_placement_policy = {
existing_policy_name = “your-compact-placement-policy-name”
specific_reservation = “your-reservation-name”
}
},
]
EOF
Furthermore, Terraform-based example designs for provisioning A3 or A3 Mega GKE clusters and nodepools are included in the Cluster Toolkit.
If you would rather use the gcloud command, construct a GKE cluster and nodepool with A3/A3 Mega VMs by following the detailed instructions in this guide.
Using these gcloud instructions, you may establish a DWS-enabled node-pool for DWS Flex:
Enable GPU direct communication with A3 TCPX/A3 Mega TCPXO and perform an initial benchmark test
To run your first set of benchmark tests, follow these instructions to install the GPUDirect for TCPX/TCPXO libraries, set up NCCL, and launch a test workload.
Verify the results of two A3 Mega nodes’ allgather benchmark tests:
size count type redop root time algbw busbw
(B) (elements) (us) (GB/s) (GB/s)
Message size is shown in the first column of the benchmark output table above, and per GPU bandwidth is shown in the algbw and busbw columns on the right. To find the cross-node bandwidth, it often use the in/out-of-place busbw column with the largest message size (highlighted row). For A3 Mega nodes with 8 NVIDIA H100 GPUs and 8 NICs, anticipate a range of 185-190GB/s per GPU, which could suggest close to cross-node 1600gbps network bandwidth.
To make sure that all of the nodes are healthy and that your cross-node network performance falls within a reasonable range, you can increase the number of nodes in the NCCL tests from two to 8, 16, 32, etc.
Set up a batch task for dispersed training.
You may deploy distributed HPC (like MPI) and AI/ML training workloads (like PyTorch, Jax, Tensorflow, etc.) on Kubernetes by using JobSet, a Kubernetes-native API for managing a collection of k8s Jobs as a unit using a uniform API.
A JobSet yaml manifest for A3 with GPUDirect-TCPX is shown in the example below, which consists of:
a. Important JobSet configuration components
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: pytorch
spec:
replicatedJobs:
name: workers
template:
spec:
parallelism: 2 #number of nodes
completions: 2 #numder of nodes
backoffLimit: 0
template:
metadata:
annotations:
gke-gcsfuse/volumes: “true”
spec:
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-h100-80gb
…
b. Pytorch main container and training job settings
C. RxDM container, gcsfuse, tcpx (A3 high), and tcpxo (A3 Mega)
d. Environment variables for NCCL
Option 2: Slurm via Cluster Toolkit
Slurm is a popular HPC job scheduler. It provides a strong solution for LLM training orchestration with well-known semantics, and researchers in both academia and industry use it. Cluster Toolkit, formerly known as Cloud HPC Toolkit, is an open-source program that makes it easier to install HPC, AI, and ML workloads on Google Cloud and supports Slurm on Google Cloud. In addition to meeting the deployment requirements of a wide range of use cases, including setting up infrastructure for extensive LLM teaching, it is made to be extremely extendable and customisable.
Provisioning A3-high and A3-mega clusters
Install Cluster Toolkit by following the public documentation’s configuration guidelines. Note some of the requirements, such as the supported versions of Terraform, Packer, and Go.
Go to the examples/machine-learning blueprints directory after you have successfully installed the Cluster Toolkit and downloaded the github repository. The a3-highgpu-8g and a3-megagpu-8g folders are used to deploy H100 clusters based on the A3-series machine shapes.
Similar in principle to Terraform or other IaC tools, Google Cloud Cluster Toolkit blueprints are Infrastructure as Code (IaC) documents that outline the infrastructure you want to implement. Three primary files govern the deployment of the a3-megagpu-8g blueprint:
- Slurm-a3mega-base.yaml consists of setting up the filestore instance for a shared home filesystem on the cluster nodes as well as the required VPC networks.
- The Compute Engine image instance that Slurm uses to provision nodes based on the cluster’s definition is created by slurm-a3mega-image.yaml.
- A3mega-cluster slurm.The Slurm controller, which is the primary orchestrator for Slurm jobs, the Slurm login node, which is a host used for job submission, and the a3mega partition, which is the cluster’s working nodes, are all set up by yaml.
You may easily get started by just entering the information for your working environment in the deployment-base.yaml and the deployment-image-cluster.yaml files, however you can modify each of the blueprint’s components if necessary.
Enable GPUDirect-TCPXO optimized NCCL communication
To enable GPUDirect-TCPXO for optimal NCCL communication on the GPU networks after the Slurm cluster has been built, follow this guide. Build and compile the NCCL tests to verify the environment and make sure the TCPXO plugin is loading correctly. After that, execute sbatch run-nccl-tests.sh from the login node, being sure to adjust the script’s node count to correspond with the number of nodes in your cluster. Using the GPUs and nodes specified in the script, this does a Distributed training all_gather test.
The NCCL tests should produce output results suggesting high-speed bandwidth throughput at different message sizes when operating as expected. The busbw number in GB/s from the second or final row of the output table, which displays the 4Gb and 8Gb message size values, is a frequently used performance metric. Approximately 190 GB/s busbw throughput should be reported by a cluster with TCPXO activated. For additional information about these measurements, visit the performance page in the NVIDIA NCCL-tests repository.
Run an NeMo training workload
Use these instructions to execute a sample NeMo Framework Pre-Training task as part of this NeMo training tutorial:
Step 1:
- Assembles the required TCPXO environment variables into a container generated from the NeMo Framework.
- Submitted Slurm jobs copy framework launcher scripts and other auxiliary files into your working directory.
Step 2:
- Installs the NeMo Framework Python package requirements and creates a Python virtual environment.
Step 3:
- Using dummy data as the input, this command performs ten stages of distributed training of a 5B parameter GPT3 model across eight nodes.
Option 3: Vertex AI
Vertex AI Model Garden and Custom Training Job service is a compelling choice for organisations looking for managed infrastructure experience and access to top open models like Llama 3.1, Mistral, etc. By providing end-to-end ML platform operations and removing the majority of the orchestration effort, this fully managed solution frees you up to concentrate on model building and experimentation. The procedure is made even simpler by Vertex AI’s end-to-end training support, which provides an integrated workflow from data preparation to deployment.
Let’s examine how to use Vertex for single- or multi-node fine-tuning and training workloads.
Single-node multi-GPU fine-tuning/training on Vertex
This notebook shows how to use the Vertex AI SDK to fine-tune and deploy Llama 3.1 models. This notebook’s examples all combine Low-Rank Adaptation (LoRA) with parameter-efficient finetuning (PEFT) techniques to lower training and storage expenses. In one PEFT technique, LoRA, pretrained model weights are frozen, and during fine-tuning, rank decomposition matrices that reflect the shift in model weights are trained.
Distributed multi-node Vertex AI training and fine-tuning
Examples for starting multi-node distributed training on an A3 Mega (8 x NVIDIA H100) on Vertex can be found in this Vertex sample training repository.
Pre-training, ongoing pre-training, and supervised fine-tuning (SFT) are demonstrated in the NeMo example. Furthermore, NeMo supports optimised training, a widely used method for assessing the AI accelerator. You can use the reported metrics, like epoch time, step-time, etc., as a benchmark. NeMo can be useful for evaluating various AI chips for a certain task because it operates on the majority of NVIDIA GPU types. To find out how to run the example on Vertex using A3 Mega node types, continue reading.
Job configuration arguments “${TRANSFER_MODEL_CMD} ${LAUNCH_CMD}” embed content from the job training script, including NCCL environments for A3 Mega, whereas Vertex CustomJob executes additional pytorch launch instructions.
Use this Dockerfile to create a custom job container image using vertex-payload.json’s “imageUri” argument.
DIY enthusiasts: Building custom training environments
Finally, it acknowledge that many organisations have certain orchestration tools or frameworks they would like to utilise and prefer a more hands-on approach. If that sounds like you, Google Compute Engine gives you the framework to develop custom training environments. You may design and set up virtual machines (VMs) with the kind and quantity of GPUs, CPU, memory, and storage that you want. You may connect your preferred orchestration tools and optimise your infrastructure for your unique training workloads with this granular control.
To make this process easier, offer sample code snippets that show you how to build and manage your vanilla A3 Mega instances using the gcloud compute instance create and gcloud compute instance bulk create API functions. These resources can help you set up your infrastructure more quickly, regardless of whether you need to provision a large-scale cluster or construct a single virtual machine.