ML Productivity Goodput
One of the most exciting times for computer scientists. Large-scale generative models have entered research and human-technology interaction, including software design, education, and creativity. As ever-larger computation becomes available, the performance and capabilities of these foundation models continue to advance. This is typically indicated by the number of floating-point operations needed to train a model.
Larger and more effective compute clusters enable this quick increase in compute scale. Nevertheless, the mean time between failures (MTBF) of the entire system decreases linearly with increasing compute cluster scale (as indicated by the number of accelerators or nodes), which causes a linear rise in the failure rate. Moreover, infrastructure costs rise linearly as well, meaning that the total cost of failure grows quadratically as compute cluster size increases.
The overall machine learning system’s true efficiency is crucial to its sustainability for large-scale training, as unchecked performance can make scaling up to a particular level impossible. However, if properly designed, it can assist you in opening up new avenues on a bigger scale. To quantify this efficiency, they present a brand-new metric called ML Productivity Goodput in this blog post. They also present techniques to maximise ML Productivity Goodput, as well as an API that you can incorporate into your projects to measure and track Goodput.
What is Goodput
The three goodput metrics that make up ML Productivity Goodput are Scheduled Goodput, Runtime Goodput, and Programme Goodput.
The fraction of time that all the resources needed to complete the training job are available is measured by scheduling goodput. Due to possible stockouts, this factor is less than 100% in on-demand or preemptible consumption models. Therefore, in order to maximise your Scheduling Goodput score, they advise you to reserve your resources.
The Runtime Goodput metric quantifies the amount of time required to advance when all training resources are available as a percentage of total time. Careful engineering considerations are necessary to maximise runtime. They go over how to measure and optimise runtime for your large-scale training jobs on Google Cloud in the following section.
The percentage of hardware performance that can be extracted by the training job is measured by Programme Goodput. Programme goodput, or the model training throughput as a percentage of the system’s peak throughput, is also known as Model Flop Utilisation or Effective Model Flop Utilisation. Effective compute communication overlaps and thoughtful distribution strategies are two important factors that determine programme output when scaling to the required number of accelerators.
Google’s Hypercomputer with AI
An AI hypercomputer is a supercomputing architecture designed to increase machine learning (ML) productivity for AI training, tuning, and serving applications. It combines a carefully chosen set of functions created through systems-level codesign. The way that various components of ML Productivity Goodput are encoded into AI Hypercomputer is shown in the following diagram:
AI Hypercomputer encodes particular capabilities targeted at optimising the Programme and Runtime Goodput across the framework, runtime, and orchestration layers, as seen in the diagram above. The remaining portion of this post will concentrate on AI Hypercomputer components that can assist you in making the most of it.
Comprehending Runtime Goodput
The quantity of beneficial training steps finished within a specified time frame is the fundamental component of runtime goodput. They can estimate Runtime Goodput as follows based on an assumed checkpointing interval, the time to reschedule the slice, and the time to resume training
In addition, this analytical model gives us the precise three factors that They should minimise to maximise the Runtime Goodput:
- Timing of the failure from the last checkpoint.
- Timing of training resume . Another important consideration is the time required to reschedule the slice; this is covered under scheduling output.
Presenting the API for Goodput Measurement
Measuring something is the first step towards making improvements. With a Python package, you can use the Goodput Measurement API to instrument (Scheduling Goodput Runtime Goodput) measurement into your code. The Goodput Measurement API offers ways to read the progress from Cloud Logging and report your training step progress to it, allowing you to measure and track Runtime Goodput.
Optimising Scheduling Productivity
Goodput scheduling is subject to the availability of EVERY resource needed to carry out the training. They introduced a DWS calendar mode that reserves compute resources for the training job in order to maximise Goodput for short-term usage. Moreover, they advise employing “hot spares” to reduce the amount of time needed to schedule resources when returning from a break. By using the hot spares and reserved resources, they can increase Scheduling Goodput.
Optimising Runtime Productivity
The following suggested techniques are provided by AI Hyper computer to optimise Runtime Goodput:
- Turn on auto-checkpointing.
- Make use of container pre-loading, which Google Kubernetes Engine offers.
- Employ a long-lasting compilation cache.
The automatic checkpointing
With auto-checkpointing, you can start checkpointing when a SIGTERM signal is received, indicating that the training job will soon be interrupted. In the event of maintenance events or preemption related to defragmentation, auto-checkpointing can be helpful in minimising loss since the last checkpoint.
Both Maxtext, a reference implementation for high-performance training and serving on Google Cloud, and orbax provide an example implementation of auto-checkpointing.
For training on both Cloud TPUs and GPUs, auto-checkpointing is available for training orchestrators that are based on GKE as well as those that are not.
Pre-loading containers
Following a failure or other disruption, it’s critical to quickly resume training in order to attain the highest possible Goodput score. Hence, Google Kubernetes Engine (GKE), which enables model and container preloading from a secondary boot disc, is what they advise. GKE’s container and model preloading enables a workload, particularly a large container image, to start up very quickly. It is currently available in preview. This implies that there will be little loss of time in training when it fails or experiences other disruptions. This is crucial because, for large images, retrieving a container image from object storage can be important when resuming a job.
Pre-loading enables you to designate a backup boot disc with the necessary container image for auto-provisioning or nodepool creation. As soon as GKE displays the failed node, the necessary container images become available, allowing you to quickly resume training.
They found that the image pull operation for a 16GB container, with container preloading, was roughly 29X faster than the baseline (image pull from container registry).
Long-term compilation cache
XLA compiler-based computation stacks are made possible in large part by just-in-time compilation and system-aware optimisations. Computation graphs are compiled only once and run repeatedly with different input data in the majority of effective training loops.
If the graph shapes don’t change, recompilation is avoided thanks to a compilation cache. This cache could be lost in the case of a failure or interruption, which would slow down the training resumption process and negatively impact the Runtime Goodput. By enabling users to save compilation cache to Cloud Storage so that it endures restart events, a persistent compilation cache helps to address this issue.
Moreover, recent developments have improved the job-scheduling throughput by 3X using GKE, the orchestration layer that is advised for AI Hyper computers, which helps decrease time to resume (trm).
Increasing Programme Output
Script Efficiency or Model Failure? As the training programme advances, utilisation is dependent upon how well the underlying compute is used. Programme Goodput is influenced by distribution strategy, effective compute communication overlap, optimised memory access, and pipeline design.
One of the key elements of the AI Hypercomputer is the XLA compiler, which enables you to optimise programme output through built-in optimisations and straightforward, effective scaling APIs like GSPMD, which let users express a variety of parallelisms with ease and effectively take advantage of scale. Three major features have been added recently to help users of PyTorch/XLA and Jax get the most out of their programmes.
Unique Kernel utilising XLA
In compiler-driven computation optimisation, they frequently require a “escape hatch” that enables users to surpass the default performance by writing more effective implementations using basic primitives for complex computation blocks. The library designed to support custom kernels for Cloud TPUs and GPUs is called Jax/Pallas. It is compatible with PyTorch/XLA and Jax. Pallas can be used to write custom kernels, such as Block Sparse Kernels and Flash Attention. For longer sequence lengths, the Flash attention kernel contributes to better Programme Goodput or Model Flop Utilisation more pronounced for sequence lengths 4K or above.
Offloading the host
Accelerator memory is a scarce resource for large-scale model training, so they frequently trade off compute cycles for accelerator memory resources by doing things like activation re-materialization. Another method they recently added to the XLA compiler is host offload, which uses host DRAM to offload activations computed during the forward pass and reuse them for gradient computation during the backward pass. By reducing the number of activation recomputation cycles, host offload enhances programme throughput.
AQT-Based Int8 Mixed Precision Training
In order to increase training efficiency and, consequently, Programme Goodput without sacrificing convergence, another method called Accute Quantized Training maps a subset of matrix multiplications in the training step to int8.
The aforementioned methods are combined in the following benchmark to increase programme goodput for a MaxText 128b dense LLM implementation.
Using all three of these strategies together increases the Programme Goodput by a cumulative 46% in this benchmark. Enhancing programme output is frequently an iterative procedure. The model architecture and training hyper parameters determine the real gains for a given training task.
In summary
Business value is enabled by large-scale training for generative models; however, as ML training scales, productivity suffers. The ML Productivity Goodput is a metric that they defined in this post to assess the overall ML productivity of large-scale training tasks. they covered the components of the AI Hypercomputer that can help you maximise ML Productivity Goodput at scale, as well as the introduction of the Goodput measurement API. With AI Hypercomputer, they look forward to assisting you in maximising your ML productivity at scale.