Google Cloud is making inference even simpler and more affordable today by integrating Kubernetes-native distributed and disaggregated inference into vLLM, which makes it completely scalable. It calls this new project llm d. Along with Red Hat, IBM Research, NVIDIA, and CoreWeave, Google Cloud is a founding contributor. Major players in the business AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, are also involved.
As the leading platform for AI development, Google has a long history of creating and contributing to important open-source projects that have influenced the cloud, including JAX, Istio, and Kubernetes. Making llm d open-source and community-led, in its opinion, is the greatest approach to make it widely accessible. This way, you can run it anywhere and know that a robust community is behind it.
LLM D: What is it?
llm-d is defined as a high-performance distributed LLM inference system that is native to Kubernetes. Another name for it is a distributed inference serving stack that is native to Kubernetes. Along with other prominent figures in the industry, the initiative was just established in May 2025 by founder contributors Google, IBM Research, NVIDIA, Red Hat, and CoreWeave. As a founding contributor, Google Cloud thinks that the best way to make llm-d broadly available and guarantee a robust community is to make it open-source and community-led.
Goals and Purpose
By allowing vLLM to be completely scalable with Kubernetes-native distributed and disaggregated inference, llm d seeks to simplify and reduce the cost of inference. With a modular approach that makes use of the most recent distributed inference optimizations, it is intended to assist customers in operationalizing GenAI deployments. With a quick time-to-value and competitive performance per dollar for the majority of models across the majority of hardware accelerators, the objective is to offer a clear route for anyone to service big language models at scale.
Efficient AI inference is becoming a gating issue as the world moves from prototyping AI solutions to implementing them at scale, which is why llm-d is necessary. The inference process might be slowed down and the user experience deteriorated by the extremely varied demands and substantial processing increase caused by agentic AI workflows and reasoning models. Although vLLM and other open-source inference engines are an important component of the solution, further innovation is possible.
Essential Elements and Advancements
llm-d expands on vLLM’s incredibly effective inference engine by using Google’s expertise in providing AI services to billions of users. It presents a number of significant innovations:
- vLLM-Optimized Inference Scheduler: llm d incorporates a vLLM-aware scheduler in place of conventional round-robin load balancing. This scheduler makes P/D-, KV-cache-, SLA-, and load-aware decisions by implementing filtering and scoring algorithms and using operational telemetry. By allowing requests to be routed to instances with low load and prefix-cache hits, it makes it possible to meet latency SLOs while using less hardware. More experienced teams can even use their own unique scorers.
- Disaggregated Serving: LLM D allows disaggregated serving to deliver longer requests with reduced latency and increased throughput. This entails managing the LLM inference process’ prefill and decode phases on separate instances. It makes use of high-performance transport libraries like NIXL to take advantage of vLLM’s support for this feature. Plans include for enabling throughput-optimized implementations with data-center networking and latency-optimized ones with fast interconnects (IB, RDMA, ICI).
- Disaggregated Prefix Caching with Multi-tier KV Cache: To enhance response time across various storage tiers and lower storage expenses, llm d implements a multi-tier KV cache for intermediate data (prefixes). For a pluggable cache hierarchy, it makes use of vLLM’s KVConnector, which enables offloading KVs to hosts, remote storage, and systems such as LMCache. There are two planned caching schemes:
- Independent (N/S) caching: Offloading to disc and local memory, offering a technique with no operating costs.
- KV transfer across instances and shared storage with global indexing is known as shared (E/W) caching, and it may provide better performance at the expense of operational complexity.
- Variant Autoscaling over Hardware, Workload, and Traffic (Planned): An autoscaler that takes hardware and traffic into account is planned. In order to determine the best mix of instances for managing various request types, this autoscaler will measure instance capacity, derive a load function taking request shapes and QoS into account, and evaluate the recent traffic mix. This will allow the use of Horizontal Pod Autoscalers (HPA) for efficiency at SLO levels.
Integration and Architecture
On top of vLLM, Kubernetes, and Inference Gateway open technologies that are industry standards ll d uses a tiered design. In order to specify its vLLM-optimized scheduling, it expands upon the Inference Gateway’s paradigm for customisable “smart” load-balancing through the Endpoint Picker Protocol (EPP). Inference Gateway’s (IGW) Kubernetes operational tooling is integrated with llm d.
It offers versatility and choice by being made to work with both GPU and TPU accelerators and a variety of frameworks (PyTorch now, with JAX scheduled for later this year). By utilizing Google Cloud’s network, GKE AI capabilities, and AI Hypercomputer integrations, llm d deployment on Google Cloud can offer low-latency and high-performance inference.
Advantages and performance
Modern distributed serving technologies are integrated into an easily deployable Kubernetes stack by llm-d. Google Cloud’s initial llm-d testing have demonstrated two-fold increases in time-to-first-token for use cases such as code completion, allowing for more responsive applications. For the majority of models, it strives for competitive performance per dollar.
Community and Development
llm d is an open development project with an Apache-2 license that is driven by the community. Subcomponent repositories like llm-d-deployer, llm-d-inference-scheduler, llm-d-kv-cache-manager, and others make up this metaproject. Users can install llm-d as a complete solution with a single Helm chart on Kubernetes, or they can clone these components separately. A Google Group, weekly standup sessions, and a Slack channel are among the tools accessible for participation.
In conclusion, llm d is a new open-source framework that addresses the increasing difficulties of AI inference at scale by utilizing Kubernetes, vLLM, and Inference Gateway to offer a scalable, high-performance, and economical solution for serving large language models in distributed contexts.