Vertex AI Google
Building or updating apps in today’s technological environment necessitates having a firm grasp of your business objectives and use cases. This understanding is essential for making the most of newly developed tools, particularly foundation models for generative AI like large language models (LLMs).
While LLMs provide clear competitive benefits, putting them into practice effectively depends on having a solid understanding of your project’s needs. During this process, choose between a self-hosted option on a platform like Google Kubernetes Engine (GKE) and a managed LLM solution like Vertex AI Google is crucial.
Why AI development should use Google Cloud
What are some of the things you should take into account when developing, implementing, and growing LLM-powered applications, though? The following advantages can be obtained by creating an AI application on Google Cloud:
- Option: Select between bringing your own open-source models to Vertex AI Google or using managed LLMs.
- Flexibility: Use GKE or Vertex AI Google to deploy a custom architecture that is suited to your LLM requirements.
- Scalability: To meet growing demand, scale your LLM infrastructure as necessary.
- End-to-end support: Take advantage of an extensive collection of services and solutions that address every stage of the lifetime of an LLM.
Self-hosted vs managed models
When comparing your long-term strategic goals with the options available for AI development in Google Cloud, take into account aspects like team experience, financial limitations, and customized needs. Let’s quickly compare the two choices.
Managed solution
Advantages:
- Simplified setup and management for ease of use
- Scaling automatically and optimizing resources
- Updates and security patches that are overseen by the service provider
- Close connection to more Google Cloud services
- Integrated security and compliance features
Cons:
- Minimal customisation for optimizing the deployment environment and infrastructure
- Possible lock-in by vendors
- Higher expenses than self-hosted, particularly when scaling
- Reduced authority over the supporting infrastructure
- Potential restrictions on the choice of model
Self-hosted on GKE
Advantages:
- Total command over the environment of deployment
- Possibility of reduced expenses at scale
- The ability to select and alter any open-source model
- Increased flexibility between cloud providers
- Precise performance and optimal use of resources
Cons:
- Substantial knowledge in DevOps for scaling, maintaining, and setting up
- Update and security accountability
- Manually setting up load balancing and scaling
- Extra work to ensure security and compliance
- Greater complexity and setup time at first
In summary, self-hosted solutions on GKE offer complete control and potential cost savings for strong technical teams with particular customization demands, while managed solutions like Vertex AI Google are perfect for teams looking for speedy deployment with low operational overhead. Let’s look at a few instances.
Create a Java general AI application, then run it in the cloud
Google Cloud created an application that lets users get quotes from well-known novels for this blog post. The original functionality was getting quotes out of a database; however, general artificial intelligence (gen AI) capabilities provide a broader feature set, enabling users to get quotes out of a managed or self-hosted large-language model.
The application and its frontend are being deployed to Cloud Run, and the models are managed by Vertex AI Google and self-hosted in GKE (using vLLM for model serving). Pre-configured book quotes can also be retrieved by the app from a CloudSQL database.
Why are businesses creating generative AI apps choosing Java?
- A well-developed ecosystem and large libraries
- Robustness and scalability, ideal for managing AI workloads
- Spring AI allows for simple integration with AI models.
- Robust security measures
- Vast experience with Java skills in numerous businesses
The quickest and easiest approach to launch your generation AI programs into production is via Cloud Run, which enables a team to:
- Create API endpoints that scale quickly and to zero to fulfill requests.
- Use GKE-compatible portable containers to run your Java gen AI applications.
- Only pay while your code is executing.
- Write code with a high app deployment velocity that is intuitive to developers.
Prior to beginning
Spring Boot Java application makes use of the Spring AI Orchestration Framework. The application offers development, test, deployment, and runtime guidelines to Cloud Run and is developed on top of Java 21 LTS and Spring Boot.
After cloning the Git repository and verifying that GraalVM and Java 21 are configured, follow the instructions.
Reference material for developing and deploying the application to Cloud Run, as well as for configuring open models to GKE, is supplemented with the codebase.
Provide GKE with an open model
To begin, let’s deploy an open model LLM to GKE. You will deploy the Meta-Llama-3.1-8B-Instruct open model to GKE.
Configuring an API token and Hugging Face access for the deployment of LLM
Once your Hugging Face account and API token are set up, follow these steps to download the LLM during runtime:
Needs:
- Make sure you can access a Google Cloud project in the chosen area that has enough quota and L4 GPUs available.
- Install the Google Cloud SDK and run
kubectl
on a PC terminal. The necessary tools are already installed; you can use the Cloud Shell feature of the Google Cloud project console.
The API token and account of Hugging Face:
- To download a model, as Llama 3.1 in this case, a Hugging Face API token is needed.
- To obtain access, go to Meta’s resource page for Llama models: Llama Meta Downloads. To download the materials, you must first register an email account.
- Using the email address that was used to register for the Meta access request, go to Hugging Face and establish an account.
- Find the Llama 3 model and complete the [Meta Llama 3.1-8B Instruct] access request form. Wait patiently for the email of acceptance.
- After being accepted, go to the settings of your account profile and receive your Hugging Face access key. During the deployment process, this access token will be used for authentication and to download the model files.
Use the instructions in the repository to set up a GKE cluster with the right GPU configurations and node pool for deploying a large language model (LLM) on GCP. The primary actions:
gcloud container clusters create $CLUSTER_NAME \
–workload-pool “${PROJECT_ID}.svc.id.goog” \
–location “$REGION” \
–enable-image-streaming –enable-shielded-nodes \
–enable-ip-alias \
–node-locations=”$ZONE_1″ \
–shielded-secure-boot –shielded-integrity-monitoring \
–workload-pool=”${PROJECT_ID}.svc.id.goog” \
–addons GcsFuseCsiDriver \
–num-nodes 1 –min-nodes 1 –max-nodes 5 \
–ephemeral-storage-local-ssd=count=2 \
–enable-ip-alias
–no-enable-master-authorized-networks \
–machine-type n2d-standard-4
gcloud container node-pools create g2-standard-24 –cluster $CLUSTER_NAME \
–accelerator type=nvidia-l4,count=1,gpu-driver-version=latest \
–machine-type g2-standard-8 \
–ephemeral-storage-local-ssd=count=1 \
–num-nodes=1 –min-nodes=0 –max-nodes=2 \
–node-locations $ZONE_1,$ZONE_2 –region $REGION –spot
Dissection of the GKE configuration
Implementation:
- Generates a single vllm-inference-server pod instance.
- Allocates certain resources (CPU, RAM, and ephemeral storage) and makes use of an NVIDIA L4 GPU.
- Mounts empty directories for shared memory and the cache.
Provided:
- Uses a ClusterIP to expose the deployment internally.
- Sets up the service to be reachable via port 8000.
BackendConfig:
- Specifies HTTP health checks to make sure the load balancer is maintaining the health of the service.
Entry:
- Sets up an incoming resource so that Google Cloud Load Balancer (GCLB) can expose the service.
- Directs outside traffic to port 8000, which is the vllm-inference-server service.
vLLM provides both a Native vLLM API and the OpenAI API. Google Cloud will utilize the OpenAI API specification-compliant API since it enables consistency between managed and GKE hosted open models.
Once the model and GCLB are deployed, you’ll see that the OpenAPI key and the Hugging face token are referenced in the deployment environment variable:
- In vLLM, you define and set the Open API Key by using the environment variable VLLM_API_KEY. Any combination of special and alphanumeric characters may be used. To administer this secret, use Google Cloud Secret Manager.
- The Hugging Face account you created earlier has the Hugging Face token available.
The substitute? Use VertexAI’s managed model
Additionally, a Meta LLama3.1 open model can be accessed as a fully managed Vertex AI Google service by only enabling it from the Vertex AI Google Model Garden.
The codebase for this blog uses the meta/llama3-405b-instruct-maas open model, which has 405b parameters.