Declaring that Trillium, Google’s sixth-generation TPU, is now generally available
A distinct infrastructure difficulty is brought about by the emergence of large-scale AI models that can analyse many modalities, such as text and images. For these models to handle training, fine-tuning, and inference effectively, massive amounts of processing power and specialized hardware are needed. In order to meet the increasing needs of AI workloads, Google started creating Tensor Processing Units (TPUs), or specialized AI accelerators, more than ten years ago. This helped pave the way for multimodal AI.
Google unveiled Trillium, its sixth-generation and most potent TPU to date, earlier this year. It is currently widely accessible to Google Cloud users.
Google’s most advanced AI model to date, Gemini 2.0, was trained with Trillium TPUs, and now businesses and startups can benefit from the same robust, effective, and environmentally friendly infrastructure.
Google Cloud’s AI Hypercomputer, a ground-breaking supercomputer design that uses an integrated set of performance-optimized hardware, open software, top machine learning frameworks, and adaptable consumption models, includes Trillium TPU as a crucial component. In order to achieve leading price-performance at scale across AI training, tuning, and serving, we are also making significant improvements to AI Hypercomputer’s open software layer as part of the general availability of Trillium TPUs. These improvements include optimisations to the XLA compiler and well-known frameworks like JAX, PyTorch, and TensorFlow.
Furthermore, next-level efficiency is provided by features like host-offloading that utilise the large host DRAM in addition to the High Bandwidth Memory, or HBM. With an unprecedented deployment of over 100,000 Trillium chips per Jupiter network fabric and 13 Petabits/sec of bisectional bandwidth, the AI Hypercomputer allows you to get the most out of a single distributed training job that can scale to hundreds of thousands of accelerators.
The following are some of the main advantages that Trillium offers over the previous generation:
- Training performance has increased by more than four times.
- An improvement in inference throughput of up to three times
- An improvement in energy efficiency of 67%
- An astounding 4.7x boost in each chip’s peak computation performance
- Double the capacity of High Bandwidth Memory (HBM)
- Increase the bandwidth of the Interchip Interconnect (ICI) by double.
- One Jupiter network fabric containing 100,000 Trillium chips
- Training performance can increase by up to 2.5 times per dollar, and inference performance can increase by up to 1.4 times per dollar.
With these improvements, Trillium may do exceptionally well on a variety of AI tasks, such as:
- Increasing the burden for AI training
- Training LLMs with Mixture of Experts (MoE) and dense models
- Performance inference and collection scheduling
- Intensive embedding models
- Providing instruction and price-performance inference
Let’s examine Trillium’s performance for each of these tasks.
Increasing the burden for AI training
Gemini 2.0 and other large models require enormous quantities of data and computation to train. These models can be trained much more quickly with Trillium’s near-linear scaling capabilities, which divide the workload evenly among numerous Trillium hosts connected by its cutting-edge Jupiter data center networking and a high-speed inter-chip interconnect within a 256-chip pod. Full-stack technology for large-scale training and TPU multislice enable this, while Titanium, a system of dynamic data-center-wide offloads ranging from host adapters to the network fabric, further optimizes it.
Even when working across a data-center network to pre-train gpt3-175b, Trillium exhibits 94% scaling efficiency across 24 pods with 6144 chips and 99% scaling efficiency with a deployment of 12 pods with 3072 chips.
Even though the graph above uses a 4-slice Trillium-256 chip pod as the baseline, scaling up to 24 pods with a 1-slice Trillium-256 chip pod as the baseline still yields over 90% scaling efficiency.
Google studies show that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at 99% scaling efficiency when training the Llama-2-70B model.
The scaling efficiency of trillium TPUs is noticeably higher than that of previous generations. Its experiments show that Trillium has a 99% scaling efficiency at 12-pod scale, as seen in the graph below, when compared to a Cloud TPU v5p cluster of the same scale (total peak flops).
Training LLMs with Mixture of Experts (MoE) and dense models
With billions of parameters, LLMs such as Gemini are naturally strong and intricate. Co-designed software optimizations and massive processing resources are needed to train such dense LLMs. Compared to the previous generation Cloud TPU v5e, Trillium offers up to 4 times faster training for dense LLMs such as Llama-2-70b and gpt3-175b.
In addition to dense LLMs, a growing number of people are training LLMs using a Mixture of Experts (MoE) architecture, which combines several “expert” neural networks, each of which specializes in a distinct facet of an AI task. Compared to training a single monolithic model, managing and coordinating many experts throughout training adds complexity. Compared to the prior generation Cloud TPU v5e, Trillium offers up to 3.8 times quicker training for MoE models.
Furthermore, in comparison to Cloud TPU v5e, Trillium TPU provides three times the host dynamic random-access memory (DRAM). This helps to optimize performance and goodput at scale by shifting some of the computation to the host. When training the Llama-3.1-405B model, Trillium’s host-offloading features result in a performance boost of more than 50%, as indicated by Model FLOPs Utilization (MFU).
Performance inference and collection scheduling
Accelerators that can effectively manage the higher processing demands are required due to the growing significance of multi-step reasoning at inference time. Trillium TPU offers notable improvements for inference tasks, facilitating the quicker and more effective implementation of AI models. Trillium provides its best TPU inference performance for dense LLMs and image diffusion. In comparison to Cloud TPU v5e, its testing shows that Stable Diffusion XL has a relative inference throughput (images per second) that is more than three times greater, while Llama2-70B has a relative inference throughput (tokens per second) that is almost two times higher.
Google’s best-performing TPU for offline and server inference applications is Trillium. When comparing Stable Diffusion XL to Cloud TPU v5e, the graph below shows a 3.1x increase in relative throughput (pictures per second) for offline inference and a 2.9x increase in relative throughput for server inference.
Trillium TPU adds a new collections scheduling tool along with improved performance. When a collection contains many replicas, this feature enables Google’s scheduling systems to make intelligent job-scheduling decisions to improve the overall availability and efficiency of inference workloads. Google Kubernetes Engine (GKE) is one method it provides for managing numerous TPU slices running a single-host or multi-host inference job. It is easy to modify the quantity of copies to meet demand when these slices are grouped into a collection.
Intensive embedding models
Third-generation SparseCore is added to Trillium, which improves DLRM DCNv2 performance by 5 times and embedding-intensive model performance by 2 times.
Dataflow processors called SparseCores offer a more flexible architectural basis for applications requiring a lot of embedding. The third-generation SparseCore from Trillium is excellent at speeding up data-dependent and dynamic processes including partitioning, sparse segment sum, and scatter-gather.
Providing instruction and price-performance inference
Trillium is made to maximize performance per dollar in addition to the absolute performance and scale needed to train some of the biggest AI workloads in the world. As of right now, Trillium can train dense LLMs like Llama2-70b and Llama3.1-405b with up to 2.1x and 2.5x performance gains per dollar over Cloud TPU v5e and v5p, respectively.
Trillium is excellent at efficiently processing big models in parallel. It is intended to enable developers and researchers to provide reliable and effective picture models at a substantially reduced cost compared to earlier. Trillium TPU is 27% less expensive than Cloud TPU v5e for offline inference and 22% less expensive than Cloud TPU v5e for server inference on SDXL when generating 1,000 photos.
Advance AI innovation to new heights
Google Cloud’s AI infrastructure has advanced significantly with Trillium TPU, which offers amazing performance, scalability, and efficiency for a variety of AI workloads. Trillium enables you to make breakthroughs more quickly and provide better AI solutions by scaling to hundreds of thousands of chips with top-notch co-designed software. Additionally, Trillium is an affordable option for businesses looking to optimize the return on their AI investments due to its outstanding price-performance ratio. Trillium is proof of Google Cloud’s dedication to offering state-of-the-art infrastructure that enables companies to realize AI’s full potential as the field continues to change.