Generative AI optimized for edge devices
How generative AI may be integrated into edge devices with constrained resources via pruning, quantization, and knowledge distillation
One in three American individuals as of April 2023 reported using generative artificial intelligence (AI):
Do you belong to that group? An worldwide AI craze was ignited in November 2022 when OpenAI debuted ChatGPT.
Furthermore, even though the majority of generative AI applications now operate on the cloud, their workloads put additional hardware and running expenses on the cloud. As a result, as apps like ChatGPT and Midjourney become more widely used, the optimum way to construct AI models is being reevaluated in light of these extra workload demands.
Since edge devices have substantial on-device AI processing capabilities, such as smartphones, laptops, and extended reality (XR) headsets, moving some or all of the AI burden to these devices is one of the most promising deployment techniques. AI models must be tailored for edge devices in order to use their available AI accelerators while implementing on-device AI.
Text production, picture and video generation, enhancement, and alteration, audio creation and enhancement, and even code generation are a few examples of generative AI demands that may be implemented locally.
AI recently declared that to want to provide large language models (LLMs) on Snapdragon platforms in 2024, based on Meta’s Llama . Once these neural networks are optimized, they will demand less memory and processing power, making them compatible with popular edge devices.
Although it is unlikely that the parameter growth of some generative AI applications, like ChatGPT, will outpace the performance improvements in mobile systems-on-chips (SoCs), there are currently a large number of sub-10 billion parameter generative AI models that are appropriate for on-device processing, and this number will only rise with time.
AI model optimization for on-device use
Artificial intelligence (AI) models used on edge devices or even in the cloud compromise accuracy for computational efficiency, while neural network models are typically taught in a data center with excellent accuracy. Finding a compromise between making the model as tiny as feasible and maintaining a high enough accuracy level for the findings to be useful in the specific use case is the aim.
The bigger the model, the more accurate the outcome is usually. Nevertheless, there are often little benefits and substantial resource costs associated with greater precision. The number of parameters in an AI model determines its size; a model with fewer parameters will often generate results faster and with less processing power.
Three methods for improving AI models
AI model optimization may be achieved using three main methods:
Quantization, pruning, and knowledge distillation.
Quantization uses lower-precision data types, such as 4-bit or 8-bit integers (INT4 or INT8) instead of the higher-precision, 32-bit floating point (FP32) data type that is typically used when training the model. This reduces the bit-precision that the AI model uses for the neural network’s weight and activation values. The model size is halved by quantizing from 32 to 8 bits.
The act of locating and removing unnecessary or duplicate parameters is known as pruning. AI model efficiency may be increased by pruning while keeping accuracy constant. Using both Bayesian compression and spatial SVD with ResNet18 as the baseline, There findings demonstrate a 3x decrease in model size with less than 1% loss in accuracy. Findings indicate that quantization generally works better than pruning.
By using a big, trained AI model as the basis for a smaller model, knowledge distillation reduces the size of the model while retaining comparable accuracy. The smaller model is often many times smaller than the original model.
Transferring AI tasks to a gadget
It’s easy to see how generative AI, or any AI application, may be transferred to an edge device like a smartphone, XR headset, or desktop PC using these three optimization strategies.
Smartphones have already shown their ability to quickly absorb functionality by using advancements in memory, computing, and sensor technologies. Mobile media players, handheld gaming consoles, point-and-shoot cameras, consumer video cameras, and prosumer digital Single Lens Reflex (dSLR) cameras have all been superseded by smartphones in less than ten years. For the last several years, 8K video has been easily captured and processed by high-end smartphones.
Well-known smartphone manufacturers already use on-device AI technologies for a number of purposes, from security and battery life to computational photography and audio improvement. This also applies to the majority of popular edge networking and edge device platforms. Reducing in size and refining generative AI models to function on these edge devices is a hurdle.
In addition to cutting latency, on-device processing of AI models also tackles data security and privacy two issues that are becoming more and more important. Data and the outcomes of the generative AI may stay on the device by removing the interface with the cloud.
Edge device optimization with generative AI is the way of the future.
The burden of managing generative AI’s workloads on the cloud will increase as its usage by consumers becomes more popular. The optimum way to apply AI models is being reevaluated as a result of these increased AI workloads on the cloud.
AI models are being shrunk to make them appropriate for on-device processing using optimization approaches including quantization, pruning, and knowledge distillation. Users may enjoy lower latency, improved privacy, customization, and other on-device AI advantages by shifting AI workloads to edge devices.