Utilizing llama.cpp, LLMs can be executed on Intel GPUs

July 31, 2024

341

The open-source project known as llama.cpp is a lightweight LLM framework that is gaining greater and greater popularity. Given its performance and customisability, developers, scholars, and fans have formed a strong community around the project. Since its launch, GitHub has over 600 contributors, 52,000 stars, 1,500 releases, and 7,400 forks. More hardware, including Intel GPUs seen in server and consumer products, is now supported by llama.cpp as a result of recent code merges. Hardware support for GPUs from other vendors and CPUs (x86 and ARM) is now combined with Intel’s GPUs.

Georgi Gerganov designed the first implementation. The project is mostly instructional in nature and acts as the primary testing ground for new features being developed for the machine learning tensor library known as ggml library. Intel is making AI more accessible to a wider range of customers by enabling inference on a greater number of devices with its latest releases. Because Llama.cpp is built in C and has a number of other appealing qualities, it is quick.

16-bit float compatibility
Support for integer quantisation (four-, five-, eight-, etc.)
Absence of reliance on outside parties
There are no runtime memory allocations.

Intel GPU SYCL Backend

GGM offers a number of backends to accommodate and adjust for various hardware. Since oneAPI supports GPUs from multiple vendors, Intel decided to construct the SYCL backend using their direct programming language, SYCL, and high-performance BLAS library, oneMKL. A programming model called SYCL is designed to increase hardware accelerator productivity. It is an embedded, single-source language with a domain focus that is built entirely on C++17.

All Intel GPUs can be used with the SYCL backend. Intel has confirmed with:

Flex Series and Data Centre GPU Max from Intel
Discrete GPU Intel Arc
Intel Arc GPU integrated with the Intel Core Ultra CPU
In Intel Core CPUs from Generations 11 through 13: iGPU

Millions of consumer devices can now conduct inference on Llama since llama.cpp now supports Intel GPUs. The SYCL backend performs noticeably better on Intel GPUs than the OpenCL (CLBlast) backend. Additionally, it supports an increasing number of devices, including CPUs and future processors with AI accelerators. For information on using the SYCL backend, please refer to the llama.cpp tutorial.

Utilise the SYCL Backend to Run LLM on an Intel GPU

For SYCL, llama.cpp contains a comprehensive manual. Any Intel GPU that supports SYCL and oneAPI can run it. GPUs from the Flex Series and Intel Data Centre GPU Max can be used by server and cloud users. On their Intel Arc GPU or iGPU on Intel Core CPUs, client users can test it out. The 11th generation Core and later iGPUs have been tested by Intel. While it functions, the older iGPU performs poorly.

The memory is the only restriction. Shared memory on the host is used by the iGPU. Its own memory is used by the dGPU. For llama2-7b-Q4 models, Intel advise utilising an iGPU with 80+ EUs (11th Gen Core and above) and shared memory that is greater than 4.5 GB (total host memory is 16 GB and higher, and half memory could be assigned to iGPU).

Put in place the Intel GPU driver

There is support for Windows (WLS2) and Linux. Intel suggests Ubuntu 22.04 for Linux, and this version was utilised for testing and development.

Linux:

sudo usermod -aG render username
sudo usermod -aG video username
sudo apt install clinfo
sudo clinfo -l

Output (example):

Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics

or

Platform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]

Set the oneAPI Runtime to ON

Install the Intel oneAPI Base Toolkit first in order to obtain oneMKL and the SYCL compiler. Turn on the oneAPI runtime next:

First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:

Linux: source /opt/intel/oneapi/setvars.sh
Windows: “C:\Program Files (x86)\Intel\oneAPI\setvars.bat\” intel64

Run sycl-ls to confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].

Build by one-click:

Linux: ./examples/sycl/build.sh
Windows: examples\sycl\win-build-sycl.bat

Note, the scripts above include the command to enable the oneAPI runtime.

Run an Example by One-Click

Download llama-2–7b.Q4_0.gguf and save to the models folder:

Linux: ./examples/sycl/run-llama2.sh
Windows: examples\sycl\win-run-llama2.bat

Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:

Linux: ./build/bin/ls-sycl-device or ./build/bin/main
Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe

Synopsis

All Intel GPUs are available to LLM developers and users via the SYCL backend included in llama.cpp. Kindly verify whether the Intel laptop, your gaming PC, or your cloud virtual machine have an iGPU, an Intel Arc GPU, or an Intel Data Centre GPU Max and Flex Series GPU. If so, llama.cpp’s wonderful LLM features on Intel GPUs are yours to enjoy. To add new features and optimise SYCL for Intel GPUs, Intel want developers to experiment and contribute to the backend. The oneAPI programming approach is a useful project to learn for cross-platform development.

Utilizing llama.cpp, LLMs can be executed on Intel GPUs

Intel GPU SYCL Backend

Utilise the SYCL Backend to Run LLM on an Intel GPU

Put in place the Intel GPU driver

Set the oneAPI Runtime to ON

Run an Example by One-Click

Synopsis

Google Cloud DORA Report: Gen AI In Software Development

BigQuery Data Canvas: Now More Powerful for Faster Insights

MSI Silent Storm Cooling AI for Best Gaming Performance

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

Realme 14T Price in India: 50MP AI Camera, 120Hz Display

Realme GT 7 Price, Availability, Design And Performance

Google Cortex Framework helps Mars Wrigley With agile media

IQM Spark Ignites Quantum era for Students and Researchers

AWS AppSync API Allows Namespace Data Source Connectors

Google Cloud DORA Report: Gen AI In Software Development

About Us

POPULAR CATEGORY