NVIDIA L4 GPUs For Cloud Run AI Inference Applications

August 22, 2024

199

Utilize NVIDIA GPUs like L4 GPU to run your artificial intelligence inference apps on Cloud Run.

L4 GPU Memory

Because of Cloud Run’s pay-per-use price, scale-to-zero features, quick autoscaling, and simplicity, developers adore it. The same advantages apply to real-time inference applications that use open-gen AI models. For this reason, Google is introducing NVIDIA L4 GPU support to Cloud Run in preview today.

This gives Cloud Run developers access to a plethora of additional use cases:

Using lightweight open models like Meta’s Llama 3 (8B) or Google’s open Gemma (2B/7B) models, real-time inference may be performed to develop custom chatbots or perform on-the-fly document summarisation while scaling to manage spikey user traffic.
Delivering specialized, fine-tuned gen AI models, including image creation suited to your brand, and reducing their usage to save costs.
Accelerating compute-intensive Cloud Run services, like 3D rendering, on-demand video transcoding and streaming, and image recognition.

By combining the ease of serverless computing with the flexibility of containers, Cloud Run is a fully managed platform that enables you to run your code directly on top of Google’s scalable infrastructure, thereby increasing productivity. Without having to worry about managing the underlying infrastructure, Cloud execute lets you execute frontend and backend services, batch processes, deploy websites and applications, and manage workloads related to queue processing.

Simultaneously, a lot of AI inference workloads, particularly those that need real-time computing, need GPU acceleration in order to provide responsive user experiences. You may quickly complete on-demand online AI inference with your preferred LLMs thanks to support for NVIDIA GPUs. Fast token rates for models with up to 9 billion parameters, such as Llama 3.1(8B), Mistral (7B), and Gemma 2 (9B), are anticipated with 24GB of vRAM. When your app is not in use, the service automatically drops to zero, so you are not charged for it.

Using Cloud Run with NVIDIA GPUs

You are no longer required to reserve your GPUs in advance, and Google Cloud now allow the attachment of one NVIDIA L4 GPU per Cloud Run instance. Currently, Cloud Run GPUs can be found in us-central1 (Iowa) and are anticipated to be available in asia-southeast1 (Singapore) and europe-west4 (Netherlands) before the year is out.

Add the –gpu=1 parameter to the command line to define the number of GPUs and the –gpu-type=nvidia-l4 flag to specify the type of GPU in order to create a Cloud Run service using NVIDIA GPUs. Or, you may use the Google Cloud console to accomplish this:

Additionally, you can easily accomplish event-driven AI inference with your functions by attaching a GPU, thanks to the freshly available Cloud Run functions.

Achievement

NVIDIA GPUs enable Cloud Run to provide robust performance in addition to easy operations. Google minimize infrastructure delay so that you can serve your models with optimal performance.

Cloud It takes about 5 seconds for run instances with an associated L4 GPU and pre-installed drivers to start, after which the processes inside your container can begin using the GPU. It will then take a few more seconds for the model and framework to load and initialize. The table below shows the cold-start times for the Ollama framework models Gemma 2b, Gemma2 9b, Llama 2 7b/13b, and Llama 3.1 8b. These times range from 11 to 35 seconds. This clocks how long it takes to launch an instance from scratch, load the model into the GPU, and get the first word back from the LLM.

Model	Model Size	Cold Start Time
gemma:2b	1.7 GB	11-17 seconds
gemma2:9b	5.1 GB	25-30 seconds
llama2:7b	3.8 GB	14-21 seconds
llama2:13b	7.4 GB	23-35 seconds
llama3.1:8b	4.7 GB	15-21 seconds

^{Cold start time: Time taken for first invocation to the service URL for Cloud Run instance to go from 0-1 and serve the first word of the response.
Models: we used 4 bit quantized versions of each of the models above. These models were deployed using the Ollama framework.
Note that these numbers are observed in a controlled lab environment and actual performance numbers may vary depending on a variety of factors. “}

Install a sample Ollama app

Here’s how to use NVIDIA GPUs with Cloud Run to deploy Google’s Gemma2 9b model with Ollama. Lightweight, state-of-the-art models built with the same technology and research as the Gemini models make up the Gemma family of open models. A framework called Ollama offers a straightforward API for handling big language models.

First, use this Dockerfile and Ollama to generate a container image using the model:

FROM ollama/ollama
ENV HOME /root
WORKDIR /
RUN ollama serve & sleep 10 && ollama pull gemma2
ENTRYPOINT [“ollama”,”serve”]

Next, use the following command to deploy:

gcloud beta run deploy –source . –port 11434 –no-cpu-throttling –cpu 8 –memory 32Gi –gpu 1 –gpu-type=nvidia-l4

That’s it, too! After it’s deployed, you may start a conversation with Gemma 2 using the Ollama API!

Moreover, NVIDIA NIM inference microservices, which are a component of the NVIDIA AI Enterprise software bundle that can be found on the Google Cloud Marketplace, can be utilised. In order to streamline AI inference deployments and optimise performance on NVIDIA L4 GPUs on Cloud Run, this offers safe, dependable deployment of high-performance AI model inferencing.

Start now

Your web apps may be hosted incredibly easily using Cloud Run. Google Cloud is also giving your AI inference apps access to the greatest serverless, simplicity, and scalability now that GPU support is available!

NVIDIA L4 GPUs For Cloud Run AI Inference Applications

L4 GPU Memory

Using Cloud Run with NVIDIA GPUs

Achievement

Install a sample Ollama app

Start now

IBM z17: AI Power with Spyre Accelerator and Telum II

PyRDP And Rogue RDP: Automating Malicious RDP Exploits

Intel Quartus Prime Pro Edition 25.1 Optimized for Agilex 3

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

IBM z17: AI Power with Spyre Accelerator and Telum II

PyRDP And Rogue RDP: Automating Malicious RDP Exploits

Bosch Quantum Sensing: A New Era in Quantum Sensor

StableHLO & OpenXLA: Enhancing Hardware Portability for ML

China’s Origin Wukong Quantum Computer for AI Model Training

Intel Extension For Scikit-learn: Time Series PCA & DBSCAN

About Us

POPULAR CATEGORY