Google JetStream: High-Performance Inference For LLMs

Learn about Google JetStream, a new engine focused on improving the speed and efficiency of Large Language Model inference. Explore its TPU support and reference implementations for PyTorch and Jax.

With 78% of businesses now in development or production, LLM-based apps are being used by a rising number of businesses across a range of industries, including retail, gambling, code creation, and customer service. Inference systems must be scalable, performant, and user-friendly as the number of generative AI applications and users grows. With AI Hypercomputer at Google Cloud, we’re setting the stage for this next stage of AI’s explosive development.

Ironwood, newest Tensor Processing Unit (TPU) specifically designed for inference, as well as software improvements like simple and performant inference using vLLM on TPU and the newest GKE inference capabilities, GKE Inference Gateway and GKE Inference Quickstart, were among the many updates Google Cloud shared at Google Cloud Next 25 regarding AI Hypercomputer’s inference capabilities.

Optimizing performance for Google JetStream: JAX inference engine

The main purpose of Google JetStream is to improve performance and memory for inferring Large Language Models (LLMs). Its primary goal is to run LLMs with optimal performance.

TPUs are first supported by the engine, which is designed to work on XLA devices. In the future, GPU support is anticipated, and pull requests are encouraged for additions.

Google JetStream provides two reference engine implementations, one for Pytorch models and one for Jax models, to make it easier to use. The Pytorch implementation has its own repository, google/jetstream-pytorch, whereas the Jax implementation refers to the google/maxtext repository.

Additionally, the project has documentation that covers a range of scenarios. This contains instructions for benchmarks, setting up a local standalone setup, providing models like Gemma using TPUs on Google Kubernetes Engine (GKE), and online inference using either MaxText or Pytorch on v5e Cloud TPU VMs. Additionally included are tools for load testing, running a local fake server, and testing key components.

Google JetStream may be found on GitHub as an open-source project. The Apache-2.0 license governs its release. Model-serving, tpu, jax, mlops, large-language-models, llm, llmops, llm-inference, gpu, inference, pytorch, transformer, llama, gpt, gemma, and other pertinent subjects in the ML and cloud computing field are all related to the project.

Google Cloud provide additional options for serving LLMs on TPU, further improving JetStream and introducing vLLM support for TPU, a popular fast and effective library for serving LLMs, in order to optimise speed and lower inference costs. It provide exceptional price-performance with low latency, high-throughput inference, and community support from Google AI specialists and open-source contributions with both vLLM on TPU and Google JetStream.

Based on the same inference stack that powers Gemini models, JetStream is Google’s open-source, throughput and memory-optimized inference engine designed specifically for TPUs. It has made large investments to further enhance Google JetStream’s performance across a variety of open models since it debuted it in April of last year. Sixth-generation Trillium TPU currently outperforms TPU v5e (using its reference implementation MaxText) in terms of throughput performance when utilizing Google JetStream by 2.9x for Llama 2 70B and 2.8x for Mixtral 8x7B.

Google’s Pathways runtime is now integrated into Google JetStream and made available to Google Cloud users for the first time. This integration allows multi-host inference and disaggregated serving, two crucial capabilities as model sizes increase rapidly and generative AI requirements change.

When serving, multi-host inference using Pathways divides the model over several accelerator hosts. This makes it possible to infer big models that aren’t compatible with a single host. Google JetStream reaches 1703 token/s on Llama 3.1 405B on Trillium with multi-host inference. In comparison to TPU v5e, this results in three times as much inference per dollar.

Furthermore, Pathways’ disaggregated serving capabilities enable workloads to separately and dynamically scale the prefill and decode phases of LLM inference. This makes it possible to use resources more effectively and can result in increased efficiency and performance, particularly for big models. Using multiple hosts with disaggregated serving for Llama2-70B performs nearly three times better for token generation (time-per-output-token, TPOT) and seven times better for prefill (time-to-first-token, TTFT) operations than interleaving the prefill and decode stages of processing LLM requests on the same Trillium server.

MaxDiffusion: Inference of high-performance diffusion models

Beyond LLMs, Trillium performs exceptionally well on tasks requiring a lot of computation, such as picture production. A number of reference implementations of different latent diffusion models are provided by MaxDiffusion. It has extended MaxDiffusion to enable Flux in addition to Stable Diffusion inference; Flux has 12 billion parameters, making it one of the biggest open source text-to-image models available to date.

Trillium currently offers a 3.5x throughput boost for queries/second on Stable Diffusion XL (SDXL) in comparison to the last performance round for its predecessor, TPU v5e, as seen on MLPerf 5.0. Throughput has increased by 12% with the submission of MLPerf 4.1.

MaxDiffusion provides a cost-effective solution with this throughput. In comparison to TPU v5e, Trillium offers 1000 photos for as little as 22 cents, which is 35% cheaper.

AI Hypercomputer is powering the age of AI inference

With integrated software frameworks and hardware accelerators, Google’s accomplishments in Artificial Intelligence inference including hardware innovations like Google Cloud TPUs and NVIDIA GPUs as well as software innovations like Google JetStream, MaxText, and MaxDiffusion are facilitating breakthroughs in AI. Find out more about inference using AI Hypercomputer. Then, to get started right now, look at these JetStream and MaxDiffusion.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index