Gemma On GKE: New features to support open generative AI models
Now is a fantastic moment for businesses using AI to innovate. Their biggest and most powerful AI model, Gemini, was just released by Google. Gemma, a family of modern, lightweight open models derived from the same technology and research as the Gemini models, was then introduced. In comparison to other open models, the Gemma 2B and 7B models perform best-in-class for their size.
They are also pre-trained and come with versions that have been fine-tuned to facilitate research and development. With the release of Gemma and their expanded platform capabilities, they will take the next step towards opening up AI to developers on Google Cloud and making it more visible.
Let’s examine the improvements they introduced to Google Kubernetes Engine (GKE) now to assist you with serving and deploying Gemma on GKE Standard and Autopilot:
Integration with Vertex AI Model Garden, Hugging Face, and Kaggle: As a GKE client, you may begin using Gemma in Vertex AI Model Garden, Hugging Face, or Kaggle. This makes it simple to deploy models to the infrastructure of your choice from the repositories of your choice.
GKE notebook using Google Colab Enterprise: Developers may now deploy and serve Gemma using Google Colab Enterprise if they would rather work on their machine learning project in an IDE-style notebook environment.
A low-latency, dependable, and reasonably priced AI inference stack: They previously revealed JetStream, a large language model (LLM) inference stack on GKE that is very effective and AI-optimized. In addition to JetStream, they have created many AI-optimized inference stacks that are both affordable and performante, supporting Gemma across ML Frameworks (PyTorch, JAX) and powered by Cloud GPUs or Google’s custom-built Tensor Processor Units (TPU).
They released a performance deepdive of Gemma on Google Cloud AI-optimized infrastructure earlier now a days, which is intended for training and servicing workloads related to generative AI.
Now, you can utilise Gemma to create portable, customisable AI apps and deploy them on GKE, regardless of whether you are a developer creating generative AI applications, an ML engineer streamlining generative AI container workloads, or an infrastructure engineer operationalizing these container workloads.
Vertex AI Model Garden, hugging face, and connecting with Kaggle
Their aim is to simplify the process of deploying AI models on GKE, regardless of the source.
Putting a Face Hug
They established a strategic alliance with Hugging Face, one of the go-to places for the AI community, earlier this year to provide data scientists, ML engineers, and developers access to the newest models. With the introduction of the Gemma model card, Hugging Face made it possible for Gemma to be deployed straight to Google Cloud. You may choose to install and serve Gemma on Vertex AI or GKE after selecting the Google Cloud option, which will take you to Vertex Model Garden.
Model Garden Vertex
Gemma now has access to over 130 models in the Vertex AI Model Garden, including open-source models, task-specific models from Google and other sources, and enterprise-ready foundation model APIs.
Kaggle
Developers can browse through thousands of trained, deployment-ready machine learning models in one location with Kaggle. A variety of model versions (PyTorch, FLAX, Transformers, etc.) are available on the Gemma model card on Kaggle, facilitating an end-to-end process for downloading, installing, and managing Gemma on a GKE cluster. Customers of Kaggle may also choose to “Open in Vertex,” which directs them to Vertex Model Garden and gives them the option to deploy Gemma as previously mentioned on Vertex AI or GKE. Gemma’s model page on Kaggle allows you to examine real-world examples that the community has posted using Gemma.
Google Colab Enterprise
Notebooks from Google Colab Enterprise
Through Vertex AI Model Garden, developers, ML engineers, and ML practitioners may now use Google Colab Enterprise notebooks to deploy and serve Gemma on GKE. The pre-populated instructions in the code cells of Colab Enterprise notebooks provide developers, ML engineers, and scientists the freedom to install and perform inference on GKE using an interface of their choice.
Serve Gemma models on infrastructure with AI optimizations
Performance per dollar and cost of service are important considerations when doing inference at scale. With Google Cloud TPUs and GPUs, an AI-optimized infrastructure stack, and high-performance and economical inference, GKE is capable of handling a wide variety of AI workloads.
By smoothly combining TPUs and GPUs, GKE enhances their ML pipelines, enabling us to take use of each device’s advantages for certain jobs while cutting down on latency and inference expenses. For example, they deploy a big text encoder on TPU to handle text prompts effectively in batches. Then, they use GPUs to run their proprietary diffusion model, which uses the word embeddings to produce beautiful visuals. Yoav HaCohen, Ph.D., Head of Lightricks’ Core Generative AI Research Team.
Gemma using TPUs on GKE
The most widely used LLMs are already supported by a number of AI-optimized inference and serving frameworks that now enable Gemma on Google Cloud TPUs, should you want to employ Google Cloud TPU accelerators with your GKE infrastructure. Among them are:
Jet Stream Today
They introduced JetStream(MaxText) and JetStream(PyTorch-XLA), a new inference engine particularly made for LLM inference, to optimise inference performance for PyTorch or JAX LLMs on Google Cloud TPUs. JetStream provides good throughput and latency for LLM inference on Google Cloud TPUs, marking a major improvement in both performance and cost effectiveness. JetStream combines sophisticated optimisation methods including continuous batching, int8 quantization for weights, activations, and KV caching to provide efficiency while optimising throughput and memory utilisation. Google’s suggested TPU inference stack is called JetStream.
Use this guide to get started with JetStream inference for Gemma on GKE and Google Cloud TPUs.
Gemma using GPUs on GKE
The most widely used LLMs are already supported by a number of AI-optimized inference and serving frameworks that now enable Gemma on Google Cloud GPUs, should you want to employ Google Cloud GPU accelerators with your GKE infrastructure.
What is vLLM
To improve serving speed for PyTorch generative AI users, vLLM is an open-source LLM serving system that has undergone extensive optimisation.
Some of the attributes of vLLM include:
- An improved transformer programme using PagedAttention
- Continuous batching to increase serving throughput overall
- Tensor parallelism and distributed serving across several GPUs
- To begin using vLLM for Gemma on GKE and Google Cloud GPUs, follow this tutorial
Text Generation Inference (TGI)
Text creation Inference (TGI), an open-source LLM serving technology developed by Hugging Face, is highly optimised to enable high-performance text generation during LLM installation and serving. Tensor parallelism, continuous batching, and distributed serving over several GPUs are among the features that TGI offers to improve overall serving performance.
Hugging Face Text Generation Inference for Gemma on GKE and Google Cloud GPUs may be used with the help of this tutorial.
Tensor RT-LLM
To improve the inference performance of the newest LLMs, customers utilising Google cloud GPU VMs with NVIDIA Tensor Core GPUs may make use of NVIDIA Tensor RT-LLM, a comprehensive library for compiling and optimising LLMs for inference. Tensor RT-LLM supports features like continuous in-flight batching and paged attention.
This guide will help you build up NVIDIA Tensor Core GPU-powered GPU virtual machines (GKE) and Google Cloud GPU VMs for NVIDIA Triton with Tensor RT LLM backend.
Google Cloud provides a selection of options to meet your needs, whether you’re a developer utilising Gemma to design next-generation AI models or choosing training and serving infrastructure for those models. GKE provides an independent, adaptable, cost-effective, and efficient platform for AI model development that may be used to the creation of subsequent models.