IBM, UIUC Use QLM and Chiron to Improve Batch Processing

An orchestration system, QLM and Chiron, is created by IBM and UIUC to better assist LLMs.

Numerous AI applications, including chatbots and coding assistants, have gained new features with large language models (LLMs) including IBM Granite, Google Gemini, OpenAI GPT-4, and Meta Llama. To facilitate specialized tasks like copywriting, financial planning, code development, and document summarization, these foundation models are further refined.

Serving many models for different applications with latency-oriented service-level goals (SLOs) has become more and more important in order to satisfy company and consumer demands. Serving interactive queries, like chatbots, with strict latency SLO requirements on the order of seconds has been the main focus of early work in this field.

Serving batch requests with loosened SLOs in the range of minutes to hours is necessary due to the recent expansion of a far wider variety of enterprise use cases. However, depending on the multiplexing, arrival rates, and configuration parameters, these SLOs may deteriorate. This calls for the implementation of an orchestration strategy that incorporates suitable autoscaling, routing, and queue management. With the help of academics from the University of Illinois Urbana-Champaign, its team at IBM Research has been working on two new initiatives, QLM and Chiron, to address this pressing need.

How are SLOs for latency defined?

LLM inference latency can be measured using two primary metrics. The time needed to finish the prefill stage and produce the first token is known as TTFT (time to first token). The time needed to generate each new token during the decode phase is known as inter-token latency. Together, these two latency criteria make up the request’s SLO.

An overview of QLM and Chiron

Depending on the particular deployment use case, IBM offers two versions of the system: Chiron and QLM (which is derived from “Queue Management for SLO-Oriented Large Language Model Serving”). Chiron can be used in scenarios where resource autoscaling allows for the addition of instances. Nonetheless, when the deployment employs fixed capacity, QLM can be applied.

Chiron

Chiron
Chiron

Chiron uses a hierarchical design to maximise throughput in two ways while meeting TTFT and ITL SLOs. Whereas a local autoscaler scales the batch size of a single instance, a global orchestrator scales and orders requests for active, mixed, and batch instances.

Non-uniform routing requests in Chiron result from each request being preferentially routed to its own instance type (batch requests to batch instances and interactive requests to interactive instances). They are redirected to the mixed instances if capacity is not available on their respective instance type. In addition to increasing overall cluster utilisation, mixed instances allow multiplexing between interactive and batch queries. The mixed instances can manage erratic surges in request arrivals for interactive queries. When there aren’t enough interactive requests, the mixed instances offer more running capacity for batch requests.

Mixed instances are preemptible to allow this multiplexing between interactive and batch requests while guaranteeing the instantaneous execution of interactive requests. This implies that batch requests may be evicted by interactive requests and returned to the global queue. It enables fast restart to avoid a throughput loss from such an eviction: It moves the KV cache to CPU memory to preserve it.

The estimation of the request queue’s waiting time serves as the foundation for the global autoscaler. Chiron is able to establish a more stringent waiting time restriction as the queue size increases due to the statistical impact of continuous batching.

QLM

QLM
QLM

An outline of the second strategy, QLM, which is intended for fixed capacity deployments, is shown in the above figure. Model swapping is used by QLM in addition to Chiron’s routing and eviction to share different models inside the same serving instance.

Request groups are created by grouping all incoming requests with requests that have similar performance attributes (such as model type, SLO value, and token distribution). Request groups are a useful concept for estimating waiting times. Requests in a group are then assigned to a virtual queue, which acts as a waiting line for an LLM serving instance in the cluster. The sequence in which the requests are executed on the associated LLM serving instance depends on the arrangement of the request groups in a virtual queue. The global scheduler reorders the groups in a virtual queue to maximise the SLO attainment for all requests being served, even though requests are allocated to groups in a first-come, first-served fashion.

an example workflow for Chiron and compare it against Llumnix
an example workflow for Chiron and compare it against Llumnix

An example Chiron workflow is displayed in the following graphic, along with a comparison to the most advanced LLM orchestration system, Llumnix. With a mean of 30 requests per second and a CV of 4, the workload initially consists solely of interactive requests that arrive with a Gamma distribution. Both Chiron and Llumnix would be over-provisioned in this scenario, with an average of 15 GPUs. Keep in mind that it employs the optimised version of Llumnix, which has throughput comparable to Chiron at the instance level.

One million requests have been added to the batch request queue at five minutes in. To minimize GPU utilization, Llumnix quickly begins adding instances over time until the maximum cluster capacity of 50 instances is reached, without enabling waiting for these batch requests. However, Chiron would like to multiplex with the over-provisioned capacity of 10 GPUs (out of 15 GPUs) and keep batch requests in the queue.

Chiron’s local autoscaler can sustain a greater throughput of 20 requests per second on this over-provisioned capacity since batch requests have a relaxed ITL SLO. In order to complete the queue before the deadline, 10 more instances are added after 50 minutes, according to Chiron’s waiting time estimation, which estimates that about 200,000 requests are still pending processing. Chiron completes all requests at 65 minutes. Llumnix continues to process requests at a lower throughput since it does not modify the batch size for the recently added instances. As a result, only 50% of requests submitted via Llumnix meet SLOs before the 65-minute limit. Overall, Chiron meets all SLOs in this scenario while using 60% fewer GPU node hours.

The graphic below illustrates how QLM and Chiron’s advantages in multiplexing, dynamic batch sizes, and model swapping result in lower serving costs. Batch and interactive queries make up an equal portion of the workload sampled from the shareGPT dataset.

Small model serving cost and Large model serving model comparison
Small model serving cost and Large model serving model comparison
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index