Turn on SqueezeLLM for Efficient LLM Inference on Intel Data Center GPU Max Series utilizing SYCLomatic for Converting CUDA to SYCL.
In brief
Researchers at the University of California, Berkeley, have devised a revolutionary quantization technique called SqueezeLLM, which enables accurate and efficient generative LLM inference. Cross-platform compatibility, however, requires unique kernel implementations and hence more implementation work.
Using the SYCLomatic tool from the Intel oneAPI Base Toolkit to take advantage of CUDA-to-SYCL migration, they were able to immediately achieve a 2.0x speedup on Intel Data Center GPUs with 4-bit quantization without the need for manual tweaking. Because of this, cross-platform compatibility may be provided with little extra technical effort needed to adapt the kernel implementations to various hardware back ends.
SqueezeLLM: Precise and Effective Low-Precision Quantization for Optimal LLM Interpretation
Because LLM inference allows for so many applications, it is becoming a common task. But LLM inference uses a lot of resources; it needs powerful computers to function. Furthermore, since generative LLM inference requires the sequential generation of output tokens, it suffers from minimal data reuse, while previous machine learning workloads have mostly been compute-bound. Low-precision quantization is one way to cut down on latency and memory use, but it may be difficult to quantize LLMs to low precision (less than 4 bits, for example) without causing an unacceptable loss of accuracy.
SqueezeLLM is a tool that UC Berkeley researchers have created to facilitate precise and efficient low-precision quantization. Two important advances are included into SqueezeLLM to overcome shortcomings in previous approaches. It employs sensitivity-weighted non-uniform quantization, which uses sensitivity to determine the optimal allocation for quantization codebook values, thereby maintaining model accuracy.
This approach addresses the inefficient representation of the underlying parameter distribution caused by the limitations of uniform quantization. Furthermore, SqueezeLLM provides dense-and-sparse quantization, which allows quantization of the remaining parameters to low precision by addressing extremely high outliers in LLM parameters by preserving outlier values in a compact sparse format.
Non-uniform quantization is used by SqueezeLLM to best represent the LLM weights with less accuracy. When generating the non-uniform codebooks, the non-uniform quantization technique takes into consideration not only the magnitude of values but also the sensitivity of parameters to mistake, offering excellent accuracy for low-precision quantization.
Dense-and-sparse quantization, which SqueezeLLM employs, allows for the greater accuracy storage of a tiny portion of outlier values. This enables precise low-precision quantization for the dense matrix by lowering the needed range that must be represented by the remaining dense component.
The difficulty is in offering cross-platform assistance for low-precision LLM quantization
The method in SqueezeLLM provides for considerable latency reduction in comparison to baseline FP16 inference, as well as efficient and accurate low-precision LLM quantization to minimize memory usage during LLM inference. Their goal was to allow cross-platform availability of these methods for improving LLM inference on systems like Intel Data Center GPUs, by enabling cross-platform support.
SqueezeLLM, on the other hand, depends on handcrafted custom kernel implementations that use dense-and-sparse quantization to tackle the outlier problem with LLM inference and non-uniform quantization to offer correct representation with extremely few bits per parameter.
Even though these kernel implementations are rather simple, it is still not ideal to manually convert and optimize them for various target hardware architectures. They predicted a large overhead while converting their SqueezeLLM kernels to operate on Intel Data Center GPUs since they first created the kernels using CUDA and it took weeks to construct, profile, and optimize these kernels.
Therefore, in order to target Intel Data Center GPUs, they needed a way to rapidly and simply migrate their own CUDA kernels to SYCL. To prevent interfering with the remainder of the inference pipeline, this calls for the ability to convert the kernels with little human labor and the ability to more easily modify the Python-level code to use the custom kernels. They also wanted the ported kernels to be as efficient as possible so that Intel customers could benefit fully from SqueezeLLM‘s efficiency.
SYCLomatic
SYCLomatic offers a way to provide cross-platform compatibility without requiring extra technical work. The effective kernel techniques may be separated from the target deployment platform by using SYCLomatic’s CUDA-to-SYCL code conversion. This allows for inference on several target architectures with little extra engineering work.
Their performance investigation shows that the SYCLomatic-ported kernels achieve a 2.0x speedup on Intel Data Center GPUs running the Llama 7B model, and instantly improve efficiency without the need for human tweaking.
CUDA to SYCL
Solution: A SYCLomatic-Powered CUDA-to-SYCL Migration for Quantized LLMs on Multiple Platforms.
First Conversion
SYCLomatic conversion was carried out in a development environment that included the Intel oneAPI Base Toolkit. Using the SYCLomatic conversion command dpct quant_cuda_kernel.cu, the kernel was moved to SYCL. They are happy to inform that the conversion script changed the kernel implementations as needed and automatically produced accurate kernel definitions. The following examples demonstrate how SYCL-compatible code was added to the kernel implementation and invocations without
Change Python Bindings to Allow Custom Kernel Calling
The bindings were modified to utilize the PyTorch XPU CPP extension (DPCPPExtension) in order to call the kernel from Python code. This enabled the migrating kernels to be deployed using a setup in the deployment environment. Python script:
Initial Bindings Installation CUDA Kernel Installation in the Setup Script
1. setup( name="quant_cuda",
2 .ext_modules=[
3. cpp_extension.CUDAExtension(
4. "quant_cuda",
5. ["quant_cuda.cpp", "quant_cuda_kernel.cu"]
6. )
7. ],
8. cmdclass={"build_ext": cpp_extension.BuildExtension},
9. )
Changed Setup Script Kernel Installation to Bindings
1. setup(
2. name='quant_sycl',
3. ext_modules=[
4. DPCPPExtension(
5. 'quant_sycl',
6. ['quant_cuda.cpp', 'quant_cuda_kernel.dp.cpp',]
7. )
8. ],
9. cmdclass={
10. 'build_ext': DpcppBuildExtension
11. }
12. )
The converted SYCL kernels could be called from PyTorch code when the kernel bindings were installed, allowing end-to-end inference to be conducted with the converted kernels. This made it easier to convert the current SqueezeLLM Python code to support the SYCL code, requiring just small changes to call the migrated kernel bindings.
Analysis of Converted Kernels’ Performance
The ported kernel implementations were tested and benchmarked by the SqueezeLLM team using Intel Data Center GPUs made accessible via the Intel Tiber Developer Cloud. As described earlier, SYCLomatic was used to convert the inference kernels, and after that, adjustments were made to enable calling the SYCL code from the SqueezeLLM Python code.
Benchmarking the 4-bit kernels on the Intel Data Center GPU Max Series allowed us to evaluate the performance gains resulting from low-precision quantization. In order to really enable efficient inference on many platforms, this evaluated if the conversion procedure might provide efficient inference kernels.
Table 1 shows the speedup and average latency for matrix-vector multiplications while using the Llama 7B model to generate 128 tokens. These findings show that substantial speedups may be achieved with the ported kernels without the need for human tweaking.
In order to evaluate the latency advantages of low-precision quantization that are achievable across various hardware back ends without requiring changes to the SYCL code, the 4-bit kernels were benchmarked on the Intel Data Center GPU. Running the Llama 7B model without any human adjustment allows SqueezeLLM to achieve a 2.0x speedup on Intel Data Center GPUs compared to baseline FP16 inference, as Table 1 illustrates.
Kernel | Latency (in seconds) |
---|---|
Baseline: fp16 Matrix-Vector Multiplication | 2.584 |
SqueezeLLM: 4-bit (0% sparsity) | 1.296 |
Speedup | 2.0x |
When this speedup is contrasted with the 4-bit inference results achieved on the NVIDIA A100 hardware platform, which achieved 1.7x speedups above baseline FP16 inference, it can be shown that the ported kernels outperform the handwritten CUDA kernels designed for NVIDIA GPU systems. These findings demonstrate that equivalent speedups on various architectures may be achieved via CUDA-to-SYCL migration utilizing SYCLomatic, all without requiring extra engineering work or manual kernel tweaking after conversion.
In summary
For new applications, LLM inference is a fundamental task, and low-precision quantization is a crucial way to increase inference productivity. SqueezeLLM allows for low-precision quantization to provide accurate and efficient generative LLM inference. However, cross-platform deployment becomes more difficult due to the need for bespoke kernel implementations. The kernel implementation may be easily converted to other hardware architectures with the help of the SYCLomatic migration tool.
For instance, SYCLomatic-migrated 4-bit SqueezeLLM kernels show a 2.0x speedup on Intel Data Center GPUs without the need for human tweaking. Thus, SYCL conversion democratizes effective LLM implementation by enabling support for many hardware platforms with no additional technical complexity.