Google Kubernetes Engine Capabilities
If you use Google Kubernetes Engine for workload execution, it’s likely that you have encountered “cold starts,” which are delays in application launch caused by workloads assigned to nodes that haven’t hosted the workload before and need the pods to spin up from scratch. When an application is autoscaling to manage a spike in traffic, the lengthier startup time may cause longer response times and a poorer user experience.
What happens when a vehicle is cold-started? Pulling container images, launching containers, and initializing the application code are some of the common tasks involved in deploying a containerized application on Kubernetes. The time it takes for a pod to begin serving traffic is extended by these procedures, which raises the latency for the initial requests that a new pod serves. The lack of a pre-existing container image on the new node might result in a much longer initial startup time. The pod doesn’t need to start up again since it is already up and heated when a subsequent request comes in.
When pods are being shut down and restarted repeatedly, requests are being sent to fresh, cold pods, which results in a high frequency of cold starts. Maintaining warm pools of pods available to lower the cold start delay is a typical remedy.
Nevertheless, the warm pool technique may be quite expensive for heavier workloads like AI/ML, particularly on pricey and in-demand GPUs. Thus, cold starts are particularly frequent for workloads including AI and ML, where pods are often shut off upon completion of requests.
The managed Kubernetes service offered by Google Cloud, Google Kubernetes Engine (GKE), may facilitate the deployment and upkeep of complex containerized workloads. They will go over four distinct methods in this article to lower cold start latency on Google Kubernetes Engine and enable you to provide responsive services.
Methods for overcoming the difficulty of chilly starts
When using bigger boot drives or local SSDs, use ephemeral storage
On a local SSD, nodes mount the root directories of the Kubelet and container runtime (docker or containerd). Because of this, the local SSD backs up the container layer; the throughput and IOPS are detailed on About local SSDs. Generally speaking, this is more economical than increasing the PD size.
The choices are compared in the accompanying table, which shows that LocalSSD has almost three times the throughput of PD for the same cost. This allows the image pull to operate more quickly and lowers the workload’s starting delay.
With the same cost | LocalSSD | PD Balanced | Throughput Comparison | |||
$ per month | Storage space (GB) | Throughput(MB/s) R W | Storage space (GB) | Throughput (MB/s) R+W | LocalSSD / PD (Read) | LocalSSD / PD (Write) |
$ | 375 | 660 350 | 300 | 140 | 471% | 250% |
$$ | 750 | 1320 700 | 600 | 168 | 786% | 417% |
$$$ | 1125 | 1980 1050 | 900 | 252 | 786% | 417% |
$$$$ | 1500 | 2650 1400 | 1200 | 336 | 789% | 417% |
With local SSDs, you may set up a node pool in an existing cluster running Google Kubernetes Engine version 1.25.3-gke.1800 or later to leverage ephemeral storage.
Turn on streaming for container images
Significant savings in workload starting time may be achieved by using picture streaming, which enables workloads to begin without waiting for the whole image to be downloaded. For instance, an NVIDIA Triton Server’s end-to-end startup time (from workload generation to server ready for traffic) may be lowered from 191s to 30s using Google Kubernetes Engine image streaming.
Make use of compressed Zstandard container images
ContainerD supports the Zstandard compression function. Zstandard benchmark indicates that zstd decompresses more than three times quicker than gzip.
Please be aware that picture streaming and Zstandard are incompatible. Zstandard is preferable if your application has to load the bulk of the container image content before it launches. Try image streaming if your application only need a tiny amount of the whole container image to load in order to begin running.
To preload the basic container on nodes, use a Preloader DaemonSet
Not to mention, if many containers share a base container, ContainerD reuses the picture layers across them. Furthermore, DaemonSet, the preloader, may begin operating even before the GPU driver (which takes around 30 seconds to install) is loaded. This implies that it may begin fetching pictures in advance and preload the necessary containers before the GPU workload can be scheduled to the GPU node.
Here’s an illustration of a DaemonSet preloader.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: container-preloader
labels:
k8s-app: container-preloader
spec:
selector:
matchLabels:
k8s-app: container-preloader
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: container-preloader
k8s-app: container-preloader
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
– matchExpressions:
– key: cloud.google.com/gke-accelerator
operator: Exists
tolerations:
– operator: “Exists”
containers:
– image: “”
name: container-preloader
command: [ “sleep”, “inf” ]
Getting beyond the frigid start
One prevalent issue in container orchestration systems is the cold start dilemma. Its effect on your Google Kubernetes Engine -running apps may be minimized with appropriate design and optimization. You may minimize cold start delays and guarantee a more responsive and effective system by leveraging ephemeral storage with bigger boot disks, turning on container streaming or Zstandard compression, and preloading the basic container with a daemonset.