Utilize Vertex AI with Cloud Run to Unlock Inference as a Service.
It’s no secret that generative AI and large language models (LLMs) have grown to be important components of the application environment. However, the majority of core LLMs are consumed as a service, which means that third parties host and serve them and provide access through APIs. In the end, developers encounter bottlenecks as a result of this dependence on external APIs.
There are numerous tried-and-true methods for hosting apps. The LLMs on which such applications rely were not able to say the same until recently. Inference as a Service is a technique that developers can take into consideration to increase velocity. Let’s see how this method benefits LLM-powered apps.
What is Inference as a Service?
Everything in cloud computing is a service. For instance, cloud companies employ servers as a metered service instead of purchasing physical servers to house your databases and apps. Here, the phrase “metered” is crucial. You pay online as an end user for the storage and computation time you consume. For more than ten years, terms like “Software-as-a-Service,” “Platform-as-a-Service,” and “Functions-as-a-Service” have been part of the cloud lexicon.
An business application can communicate with the machine learning model (in this case, the LLM) with minimal operational overhead by using “Inference as a Service.” This implies you don’t need to worry about infrastructure when running your code to communicate with the LLM.
Why Cloud Run for Inference as a Service
Google Cloud’s serverless container platform is called Cloud Run. To put it briefly, it enables developers to take advantage of container runtimes without worrying about the infrastructure. Serverless has traditionally focused on functions. Because Cloud Run only charges while the service is in use, it’s a wonderful choice for powering your LLM-powered apps.
First get familiarize yourself with Vertex AI. The all-in-one AI/ML platform from Google Cloud, Vertex AI, provides the foundational tools needed by a business to train and run ML models. More than 160 foundation models, including first-party (Gemini), third-party, and open source models, are available in Vertex AI’s Model Garden.
Activate the Gemini API before using Vertex AI for inference. Vertex AI offers both standard and express modes for inference. The application will then smoothly infer with Vertex AI when you deploy it as a container on Cloud Run by simply adding the appropriate Google Cloud credentials to it. You can use this GitHub sample to test things out for yourself.
Google Cloud offers a new degree of flexibility with GPUs for Cloud Run, while Vertex AI delivers managed inference endpoints. The paradigm of inference is radically altered by this. Why? Because you can now containerize your LLM (or other models) and deploy them straight to Cloud Run, rather than depending entirely on Vertex AI’s infrastructure.
This implies that you are hosting the LLM itself on a serverless architecture rather than merely constructing a serverless overlay around it. Models optimize costs and performance by scaling dynamically with demand and scaling to zero when not in use. To enable autonomous scaling and management, you could, for instance, deploy a chat agent on one Cloud Run service and an LLM on another. Additionally, a Cloud Run service can be prepared for inference in less than 30 seconds using GPU acceleration.
Tailor your LLM with RAG
In addition to hosting and growing LLMs, you’ll frequently need to customize their replies for certain datasets or domains. Retrieval-Augmented Generation (RAG), a crucial element of expanding your LLM experience and rapidly emerging standard for contextual customization, enters the picture here.
Consider it like way: Your apps must make use of your data, but LLMs are trained on large datasets. RAG stores embeddings of your private information in a vector database, such as AlloyDB. RAG retrieves pertinent embeddings when your application queries an LLM, giving the LLM the context it needs to produce incredibly precise and detailed answers.
Inference as a Service can be used in several ways. Examining this architecture, for instance, reveals that Cloud Run manages the central inference logic, coordinating communications between Vertex AI and AlloyDB. In particular, it manages the entire RAG data flow by acting as the bridge for both retrieving data from AlloyDB and sending queries to Vertex AI.

Let’s take an Example

Think about the architecture of a chatbot. Google clouds chatbot is hosted by Cloud Run in the architecture. Its developer can create an application with popular chatbot solutions like Langchain and Streamlit. After that, it can store embeddings in AlloyDB and infer using LLMs hosted in the Vertex AI Model Garden or another instance of Cloud Run. This provides you with a serverless runtime for a customizable gen AI chatbot.