Utilizing KEDA(Kubernetes Event-Driven Autoscaler) For GKE

December 13, 2024

237

Using KEDA to scale Google Kubernetes Engine to zero

Scaling deployments down to zero while they are idle can result in considerable cost savings for developers and companies using Google Kubernetes Engine (GKE) to operate applications. Although GKE’s Cluster Autoscaler effectively controls node pool sizes, it lacks native scale-to-zero capabilities, so you’ll need an alternative for applications that need to completely shut down and restart (scaling the node pool all the way to and from zero). Applications with sporadic workloads or fluctuating traffic patterns should pay particular attention to this.

To show in this blog post how to integrate the open-source Kubernetes Event-driven Autoscaler (KEDA). By just paying for the resources used, KEDA allows you to directly match your expenses to your demands.

Why scale to zero?

One of the main motivations for scaling to zero is cost reduction, which is applicable in many different contexts.

This is especially important for technical specialists when handling:

GPU-intensive workloads: AI/ML tasks frequently call for strong GPUs, which can be costly to maintain even when not in use.
Applications with known downtime: Internal tools that have set use hours reduce resources for apps that are only utilized on certain days of the week or during business hours.
Applications with known times of low activity should scale to zero during the off-season.
On-demand staging environments: For testing and validation, replicate production environments, then scale them down to zero after testing is over.
Environments for development, demonstration, and proof-of-concept:
- Temporary demonstrations: Present features or apps to stakeholders or clients, then scale back resources following the demonstration.
- Test novel concepts or technologies in a real setting using temporary proof-of-concept installations, which scale to zero following assessment.
- Development environment: Optimize expenses for short-term workloads by spinning up resources for testing, code reviews, or feature branches and scaling them down to zero when not needed.
Applications that are event-driven:
- Microservices with irregular traffic: Optimize resource use for erratic traffic patterns by scaling individual services to zero when they are not in use and automatically scaling them up when requests come in.
- Without having to manage servers, serverless functions run code in response to events and scale to zero when not in use.
Disaster recovery and business continuity: Reduce expenses while maintaining company continuity by keeping a small number of essential resources in a standby condition that can be quickly scaled up in the event of a disaster.

What is KEDA?

KEDA is an Kubernetes Event-driven Autoscaler built on top of Kubernetes. Any Kubernetes container’s scalability may be controlled with KEDA according to the volume of events that must be handled.

A lightweight, one-use component, KEDA may be integrated into any Kubernetes cluster. Working in tandem with common Kubernetes components such as the Horizontal Pod Autoscaler, KEDA may expand functionality without causing duplication or overwriting. With KEDA, you may specifically map which applications you wish to utilize on an event-driven scale while keeping the other apps running. Because of this, KEDA is a versatile and secure choice that can be used with any number of different Kubernetes frameworks or apps.

Features

Easy Autoscaling

Give each workload in your Kubernetes cluster rich scalability.

Event-driven

Scale your event-driven application wisely.

Built-in Scalers

A list of more than fifty integrated scalers for databases, telemetry systems, messaging apps, cloud platforms, CI/CD, and more.

Multiple Workload Types

/scale sub-resource support for a range of workload types, including tasks, deployments, and custom resources.

Reduce environmental impact

Create enduring platforms through scale-to-zero and workload scheduling optimisation.

Extensible

Use community-maintained scalers or bring your own.

Vendor-Agnostic

Trigger support for a range of cloud providers and products

Azure Functions Support

Utilize Kubernetes to run and grow your Azure Functions in production workloads.

Presenting KEDA for GKE

You may scale deployments using KEDA, an open-source, Kubernetes-native solution, according to a range of metrics and events. Incoming HTTP requests or the depth of the message queue are examples of external events that might cause KEDA to initiate scaling operations. KEDA is a good option for managing sporadic jobs or applications with variable demand since, in contrast to the existing Horizontal Pod Autoscaler (HPA) version, it allows scaling workloads to zero.

Use cases

Let’s examine two such situations where the scale-to-zero features of KEDA are useful:

Growing a Pub/Sub Employee

Situation: Messages from a Pub/Sub topic are processed by a deployment. Scaling down to zero when there are no messages available saves money and resources.

Solution: The Pub/Sub scaler from KEDA keeps an eye on the message queue and initiates scaling operations as necessary. You may indicate that the deployment scales down to zero copies when the queue is empty by setting a Scaled Object resource.

Scaling a workload that depends on GPUs, like an Ollama deployment for LLM serving

Scenario: Inference tasks are carried out using a large language model (LLM) based on Ollama. When there are no inference requests, the deployment must scale down to zero in order to reduce GPU consumption and expenses.

Solution: Scale-to-zero functionality is made possible by combining Ollama with HTTP-KEDA, a beta feature of KEDA. Ollama supports the LLM, but HTTP-KEDA grows installations according to HTTP request metrics.

Start using KEDA on GKE

For GKE, KEDA provides a strong and adaptable way to achieve scale-to-zero functionality. You may increase the effectiveness of your Kubernetes installations, save expenses, and optimize resource use by utilising KEDA’s event-driven scaling features. Because the scale to zero technique might affect workload performance, please remember to evaluate use situations.

Due to cold beginnings, scaling to zero may result in increased latency. An application indicates that there are no instances running when it scales to zero. Latency is increased since a new instance must be launched each time a request is received. State management is another factor to take into account. Any in-memory state is lost when instances are terminated.