You may have noticed that your Cluster upgrades is currently running considerably faster if you use Google Kubernetes Engine (GKE) to handle stateful applications. You’re not having hallucinations. The speed at which Persistent Discs (PDs) are attached and detached has been greatly increased by a recent addition to GKE and Google Compute Engine. As a result, user workloads and interactions with persistent storage are improved by reduced latency. This is particularly noticeable during cluster upgrades, which historically slowed down the process by requiring a lot of attach and detach requests when transferring discs to a new virtual machine.
Kubernetes storage and GCP CSI driver
A workload’s (and its storage’s) lifespan was connected to its underlying virtual machine (VM) prior to the majority of apps running in containers on Kubernetes. Because you had to decommission the virtual machine (VM), manually remove and reconnect discs to the new VM, remount the disc pathways, and restart the program, moving a workload to a new VM became laborious and prone to errors.
Discs were rarely removed from virtual machines (VMs) and reattached to new ones; when they were, it was usually only during startup and shutdown and during periods of low traffic. Because of this, it was challenging to suggest Kubernetes as a platform for executing IO-bound, stateful applications. Google Cloud required a solution that focused on storage.
The Google Compute Engine Persistent Disc (GCE PD) Container Storage Interface (CSI) Storage Plugin was the solution in Google Cloud. An essential part of GKE, this GCP CSI driver oversees the Compute Engine PD lifespan in a GKE cluster. It facilitates tasks like provisioning, attaching, detaching, and updating filesystems, allowing Kubernetes workloads to access storage with ease. This makes it possible for workloads to flow smoothly between GKE nodes, facilitating migrations, scaling, and upgrades.
But there’s an issue. GKE offers high-scale workload placement flexibility. Workloads may use several Persistent Volumes, and nodes can support hundreds of pods. This corresponds to tens to hundreds of PDs linked to virtual machines that require monitoring, administration, and reconciliation. To preserve workload availability and minimise cluster upgrade delays, you must shorten the time required to restart workload pods and relocate PDs to a new node in the event of a node update. Compared to the current system, which was created for virtual machines (VMs), this may result in an order of magnitude more attach/detach actions.
Due to the exponential growth of stateful apps on GKE, Google Cloud need a GCP CSI driver design that could effectively manage these massive activities. In order to minimise downtime and facilitate workload shifts, Google Cloud has to reconsider the underlying architecture in order to optimise the PD attachment and detachment processes.
Merging queued operations for volume attachments
As previously mentioned, serialised volume detach and attach procedures caused extremely significant latency during software upgrades for GKE nodes with a high number of PD volumes (up to 128). As an example, consider a node with 64 associated PD volumes. Before the most recent optimisation, 64 requests were sent by the GCP CSI driver to remove every disc from the original node and 64 requests to reattach every disc to the upgraded node. Compute Engine, on the other hand, only permitted up to 32 of these requests to be queued at once before serially processing the associated actions.
The GCP CSI driver would have to try requests again until there was capacity available if they were not permitted to the queue. The node upgrade would have been delayed by more than ten minutes if each of the 128 detach and attach processes took five seconds. This latency is lowered to slightly over a minute with the new optimisation.
Google Cloud felt that introducing this optimisation in a clear manner without upsetting clients was crucial. At the per-volume level, the GCP CSI driver monitors and retries attach and detach operations. But Google Cloud couldn’t just update the OSS community requirements because GCP CSI driver aren’t made for large-scale operations. Google Cloud’s answer was to use Compute Engine’s transparent operation merging feature, which allows the Compute Engine control to combine incoming attach and detach requests into a single workflow while preserving rollback and per-operation fault handling.
Detach and attach activities are transparently parallelised by this recently added operation merging in the Compute Engine. Furthermore, up to 128 outstanding requests per node are now possible due to enhanced queue capacity. To benefit from the Compute Engine optimisation, which opportunistically combines the queued processes, the GCP CSI driver keeps operating as it has in the past, handling separate detach and attach requests without alteration.
Compute Engine calculates the final state of the node-volume attachments once the initially running detach and attach procedure is finished, and it then starts to reconcile this state with downstream systems. The result of this merging is that, for 64 concurrent attach activities, the first attachment starts executing right away, and the subsequent 63 operations are merged and queued for execution as soon as the first operation is finished. GKE saw startling improvements in end-to-end latency as a result. The nicest thing about these enhancements is that they automatically benefit and no customer action is required:
- P99 latency for attach workloads has improved by about 80%.
- P99 latency for detach workloads has improved by about 60%.

A tiered approach to workflow execution is introduced by this unique solution. It introduces two new workflows that manage the basic business-logic workflow, separating incoming requests from the actual execution of volume attachments and detachments.

Instead of only one process, the Compute Engine API now generates up to three when it gets a request to attach a disc:
- Although it does not control the actual execution, the first workflow is directly tied to the attachDisk operation. Instead, it merely checks for completion and fixes any faults found in the action that is visible to the user.
- For pending attachDisk or detachDisk activities, the second workflow that is formed might be thought of as a watchdog. Every Compute Engine instance resource has a single watchdog flow. It is produced on demand as part of the original request to attach/detach a disc, and it ends after all outstanding activities have been marked as done, provided that there isn’t an existing watchdog flow.
- Lastly, the third workflow, which does the actual attachDisk/detachDisk processing, is guaranteed to exist by a watchdog flow. The operation-polling workflows are directly informed by this business-logic workflow if each attachDisk or detachDisk activity was successful or unsuccessful.
Incoming attachDisk/detachDisk requests are also not immediately queued within the same database row as the target Compute Engine instance entity in order to aid deliver optimal HTTP latency and reduce database contention. Instead, the watchdog flow monitors the creation of the request data as a sub-resource row of the instance entity, which is then picked up and processed in first-in, first-out (FIFO) order (for GKE).