The open-source project known as llama.cpp is a lightweight LLM framework that is gaining greater and greater popularity. Given its performance and customisability, developers, scholars, and fans have formed a strong community around the project. Since its launch, GitHub has over 600 contributors, 52,000 stars, 1,500 releases, and 7,400 forks. More hardware, including Intel GPUs seen in server and consumer products, is now supported by llama.cpp as a result of recent code merges. Hardware support for GPUs from other vendors and CPUs (x86 and ARM) is now combined with Intel’s GPUs.
Georgi Gerganov designed the first implementation. The project is mostly instructional in nature and acts as the primary testing ground for new features being developed for the machine learning tensor library known as ggml library. Intel is making AI more accessible to a wider range of customers by enabling inference on a greater number of devices with its latest releases. Because Llama.cpp is built in C and has a number of other appealing qualities, it is quick.
- 16-bit float compatibility
- Support for integer quantisation (four-, five-, eight-, etc.)
- Absence of reliance on outside parties
- There are no runtime memory allocations.
Intel GPU SYCL Backend
GGM offers a number of backends to accommodate and adjust for various hardware. Since oneAPI supports GPUs from multiple vendors, Intel decided to construct the SYCL backend using their direct programming language, SYCL, and high-performance BLAS library, oneMKL. A programming model called SYCL is designed to increase hardware accelerator productivity. It is an embedded, single-source language with a domain focus that is built entirely on C++17.
All Intel GPUs can be used with the SYCL backend. Intel has confirmed with:
- Flex Series and Data Centre GPU Max from Intel
- Discrete GPU Intel Arc
- Intel Arc GPU integrated with the Intel Core Ultra CPU
- In Intel Core CPUs from Generations 11 through 13: iGPU
Millions of consumer devices can now conduct inference on Llama since llama.cpp now supports Intel GPUs. The SYCL backend performs noticeably better on Intel GPUs than the OpenCL (CLBlast) backend. Additionally, it supports an increasing number of devices, including CPUs and future processors with AI accelerators. For information on using the SYCL backend, please refer to the llama.cpp tutorial.
Utilise the SYCL Backend to Run LLM on an Intel GPU
For SYCL, llama.cpp contains a comprehensive manual. Any Intel GPU that supports SYCL and oneAPI can run it. GPUs from the Flex Series and Intel Data Centre GPU Max can be used by server and cloud users. On their Intel Arc GPU or iGPU on Intel Core CPUs, client users can test it out. The 11th generation Core and later iGPUs have been tested by Intel. While it functions, the older iGPU performs poorly.
The memory is the only restriction. Shared memory on the host is used by the iGPU. Its own memory is used by the dGPU. For llama2-7b-Q4 models, Intel advise utilising an iGPU with 80+ EUs (11th Gen Core and above) and shared memory that is greater than 4.5 GB (total host memory is 16 GB and higher, and half memory could be assigned to iGPU).
Put in place the Intel GPU driver
There is support for Windows (WLS2) and Linux. Intel suggests Ubuntu 22.04 for Linux, and this version was utilised for testing and development.
Linux:
sudo usermod -aG render username sudo usermod -aG video username sudo apt install clinfo sudo clinfo -l
Output (example):
Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics
or
Platform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]
Set the oneAPI Runtime to ON
Install the Intel oneAPI Base Toolkit first in order to obtain oneMKL and the SYCL compiler. Turn on the oneAPI runtime next:
First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:
- Linux: source /opt/intel/oneapi/setvars.sh
- Windows: “C:\Program Files (x86)\Intel\oneAPI\setvars.bat\” intel64
Run sycl-ls to confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Build by one-click:
- Linux: ./examples/sycl/build.sh
- Windows: examples\sycl\win-build-sycl.bat
Note, the scripts above include the command to enable the oneAPI runtime.
Run an Example by One-Click
Download llama-2–7b.Q4_0.gguf and save to the models folder:
- Linux: ./examples/sycl/run-llama2.sh
- Windows: examples\sycl\win-run-llama2.bat
Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:
- Linux: ./build/bin/ls-sycl-device or ./build/bin/main
- Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe
Synopsis
All Intel GPUs are available to LLM developers and users via the SYCL backend included in llama.cpp. Kindly verify whether the Intel laptop, your gaming PC, or your cloud virtual machine have an iGPU, an Intel Arc GPU, or an Intel Data Centre GPU Max and Flex Series GPU. If so, llama.cpp’s wonderful LLM features on Intel GPUs are yours to enjoy. To add new features and optimise SYCL for Intel GPUs, Intel want developers to experiment and contribute to the backend. The oneAPI programming approach is a useful project to learn for cross-platform development.