Wednesday, April 2, 2025

vLLM V1 Engine Strengthens LLM Serving On Intel GPUs

Use vLLM to maximise LLM serving on Intel GPUs.

A quick and simple library for LLM inference and serving is called vLLM. It has developed into a community-driven initiative that incorporates input from business and academia. Intel is actively striving to provide vLLM on Intel platforms, such as Intel Xeon Scalable Processors, Intel discrete GPUs, and Intel Gaud AI accelerators, as one of the community contributors. This blog gives you the information you need to keep your workloads working smoothly on your Intel graphics cards, with a current focus on Intel discrete GPUs.

What Does It Support?

The improvements made by the vLLM V1 engine to Intel GPUs include:

  • API Server & Execution Loop Optimisation
  • Easy & Adaptable Scheduling
  • Zero-Overhead Caching of Prefixes
  • Clear Architecture for Inference in Tensor-Parallel
  • Effective Preparation of Input
  • Improved Assistance for Multimodal LLMs

Additionally, chunked_prefill, a vLLM optimisation feature that enables big prefill requests to be broken up into smaller pieces and batched with decode requests, is enabled. By merging compute-bound (prefill) and memory-bound (decode) requests in a single batch, this method improves inter-token delay (ITL) and GPU utilisation while giving priority to decode requests. This functionality serves as the foundation for the vLLM v1 engine, which is now supported on Intel GPUs by utilising the relevant kernel from the Intel Extension for PyTorch for model execution.

The following features will be supported in a future release.

  • Spec decode: Using a short, quick draft model to forecast future tokens, spec decoding in vLLM is a technique intended to reduce inter-token delay during LLM inference.
  • Sliding window: By restricting the context length to a predetermined window size, sliding window attention is a technique used in big language models to effectively manage memory usage. This method is especially helpful for managing lengthy sequences without going over memory limits since it enables the model to concentrate on the most recent tokens while discarding older ones.
  • FP8 KV cache:Intel will use kernels from the Intel Extension for PyTorch to support the FP8 KV cache. It essentially doubles the amount of space available for KV cache allocation by enabling the storage of more tokens in the cache. By processing longer context lengths for individual requests or managing more concurrent request batches, this increase in storage capacity improves throughput.

Intel-verified models are listed in the table below. Nevertheless, vLLM should allow more comprehensive models that run on Intel GPUs.

Model TypeModel
Text-generationhugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
Text-generationTheBloke/deepseek-llm-7B-chat-GPTQ
Text-generationTheBloke/Mistral-7B-v0.1-GPTQ
Text-generationshuyuej/Phi-3-mini-128k-instruct-GPTQ
Text-generationQwen/Qwen2-7B-Instruct-GPTQ-Int4

Limitations

Certain features of vLLM V1, such as support for torch.compile, LoRA, pipeline parallel on Ray, structured outputs, EP/TP MoE, DP Attentions, prefix prefill, and MLA-related features, could require additional assistance.

Intel intend to address the following known concerns in upcoming releases:

  • AttributeError causes the Bloke/baichuan-7B-GPTQ model to fail: The vocab_size parameter of BaiChuanTokenizer is error-free.
  • There are issues with the ranchlai/chatglm3-6B-gptq-4bit model. The Transformers implementation is incompatible with vLLM error, and ChatGLMForConditionalGeneration lacks vLLM implementation.
  • To prevent the problem, the total length of the input and output tokens must be less than the model’s –max_position_embeddings parameter. ValueError: The maximum context length for this model is xxxx tokens. But you asked for xxxx tokens (xxx in the completion, xxxx in the messages). Please make the messages or completions shorter.
  • The lm-eval accuracy value for the Qwen_Qwen2-7B-Instruct-GPTQ-Int4 and jakiAJK_DeepSeek-R1-Distill-Qwen-7B_GPTQ-int4 models is 0.
  • Accuracy testing is not supported by the run-lm-eval-gsm-vllm-baseline.sh script in the docker image that is referenced in this blog.
  • Warning warnings such as Pin memory is not supported on XPU may appear when you use the Docker image described in this blog. These messages can be left out because they were printed incorrectly.
  • Memory usage for AWQ models is greater than the model size. 8.6GB of RAM was used by the Casperhansen/llama-3-8b-instruct-awq model, which had a capacity of 5.74 GB.

How to Get Started

Prerequisite

OSHardware
Ubuntu 24.10Intel Arc B580
Ubuntu 22.04Intel Data Center GPU Max Series

Prepare a Serving Environment

  • To install the driver packages, adhere to the directions.
  • Use the command docker pull intel/vllm:xpu to obtain the released docker image.
  • Use the command docker run -t -d –shm-size 10g to create a docker container. –name=vllm-test –device /dev/dri:/dev/dri –entrypoint= intel/vllm:xpu /bin/bash –net=host –ipc=host –privileged -v /dev/dri/by-path:/dev/dri/by-path

You will need to set up two distinct container environments for a vLLM server and a client that sends out requests, respectively, while this container runs in the background as a daemon.

Enter the server and client container environments, respectively, by running the command docker exec -it vllm-test bash in two different terminals.

From from point forward, unless otherwise specified, all commands are supposed to be executed inside the Docker container.

To ensure that the required files can be downloaded from the HuggingFace website, you might then want to establish a HUGGING_FACE_HUB_TOKEN environment variable in both environments.

export HUGGING_FACE_HUB_TOKEN=xxxxxx

 Launch Workloads

 Launch Server in the Server Environment

Command:

VLLM_USE_V1=1 W_LONG_MAX_MODEL_LEN=1 VLLM_WORKER_MULTIPROC_METHOD=spawn  python3 -m vllm.entrypoints.openai.api_server --model TechxGenus/Meta-Llama-3-8B-GPTQ --dtype=float16 --device=xpu --enforce-eager --port 8000  --block-size 32 --gpu-memory-util 0.85 --trust-remote-code --disable-sliding-window

Raise Requests for Benchmarking in the Client Environment

To do performance benchmarking,Intel use a benchmarking script that comes with vLLM. Your client scripts can also be used.

To shoot serving requests, use the following command:

python3 benchmarks/benchmark_serving.py --model TechxGenus/Meta-Llama-3-8B-GPTQ --dataset-name random --random-input-len=1024 --random-output-len=1024 --ignore-eos --num-prompt 1 --max-concurrency 16 --request-rate inf --backend vllm --port=8000 --host 0.0.0.0

 Performance

Intel tested vLLM V1’s performance using the docker container environments and commands described above on a system equipped with an Intel Core Ultra 5 245KF CPU and an Intel Arc B580 discrete graphics card. The throughput value increased when more prompts were sent to the vLLM server using this benchmarking configuration. With 16 concurrent queries, it peaked and stabilised as the machine’s hardware resources were depleted.

If necessary, you can further adjust performance to meet your needs by consulting the vLLM optimisation and tuning guide and/or vLLM environment settings.

Note: To ensure consistency, the performance figures displayed in the figure below are the outcomes of executions that did not make use of the prefix cache functionality. Cache utilisation optimisations are introduced by the prefix cache feature. However, because benchmarking runs are random, performance could differ slightly between them.

Throughput with vLLM v1 engine on Intel Arc B580
Image credit to Intel

Need Assistance?

Please open an issue ticket at vLLM Github Issues if you run into any problems or have any enquiries. To make sure it is seen, include the text [Intel GPU] in the issue title.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post