Saturday, July 6, 2024

On-Device AI Magic: Unlock Advanced Strategies

Advanced strategies for rapid, accurate, and efficient on-device generative AI models

With generative AI use rising at historic rates and compute needs rising, on-device AI processing is crucial. The first Android phone-based Stable Diffusion demo was shown at MWC 2023. Much development has been achieved since then.

After Snapdragon Summit 2023, we exhibited some stunning On-Device AI demonstrations on Snapdragon 8 Gen 3 Mobile Platform smartphones and Snapdragon X Elite Platform computers.they will show you how Qualcomm AI Research uses sophisticated methodologies and full-stack On-Device AI optimizations to make on-device generative AI demo experiences quick and efficient.

Effective On-Device AI via knowledge distillation and quantization-aware training

Knowledge distillation and quantization helped us shrink, speed up, and improve models.

Knowledge distillation is a transfer learning strategy that trains a smaller “student” model to replicate a bigger “teacher” model with high accuracy. Matching model logits transfers instructor model representation to student and minimizes distillation loss. Knowledge distillation provides a smaller model for quicker inference without using the teacher’s training process.

Quantization decreases weight parameter and activation calculation bit accuracy. They developed basic methods to decrease quantization errors, many of which are available in the AI Model Efficiency Toolkit (AIMET) on the Qualcomm AI Stack and GitHub. Post-training quantization needs no training but may not reach the requisite accuracy at certain bit precisions. Quantization-aware training simulates quantization during fine-tuning to decrease loss and improve model accuracy.

Creating the quickest phone Stable Diffusion      

Our Stable Diffusion implementation ran in under a second at Snapdragon Summit. We accomplished it how? As its name says, Stable Diffusion creates visuals using reverse diffusion conditioned on the input prompt. The bottleneck must be found to speed it up.

The largest component model in the Stable Diffusion architecture is the UNet, which is repeated multiple times to denoise the picture until the output image is formed. Image quality sometimes requires 20 or more processes. UNet repetition is the computing bottleneck. Our system improvements reduced denoising calculations.

Our study used knowledge distillation, guidance conditioning, and step distillation to create an efficient UNet. We saved computation by pruning the attention blocks in the first layer of the UNet, and knowledge distillation restored model accuracy. Knowledge distillation again reduced the number of stages by teaching the student model to match a teacher’s correctness, which takes several steps.

Results: quick, high-quality picture production

We created the quickest phone diffusion-based text-to-image generation by combining these methods. Running on a Snapdragon 8 Gen 3 phone, we got a 9x speedup against the baseline Stable Diffusion and over 25x faster than at MWC 2023.

As seen in the photos above, CLIP and FID perceptual metrics and qualitative findings were encouraging for image quality. Generating high-quality photos in under a second improves user experience and expands generative AI application cases.

Fastest phone Llama 2-7B ever    

To optimize a big language model (LLM) like Llama 2, find its bottlenecks. Since LLMs are autoregressive, their output values rely on their past values, hence processing must be sequential.

LLMs create one token every inference, which is a linguistic unit like words, numerals, and punctuation. Each response token is generated using all LLM parameters. Llama 2-7B requires seven billion parameters to create each token, which requires a lot of bandwidth to read these weights for calculations. In addition, LLMs must create several tokens for chatbot answers such a few words in seconds.

So, LLMs are bandwidth constrained rather than computation limited since memory bandwidth is the barrier. This inspired us to find the finest device LLM acceleration methods. Knowledge distillation, quantization-aware training, and speculative decoding lowered memory bandwidth.

We optimized full-stack AI for fast Llama 2 via knowledge distillation, quantization-aware training, and speculative decoding.

Quantization-aware knowledge distillation training

To compress the model by 4x, we want Llama 2’s parameters to be 4-bit integer (INT4). It was trained using floating-point 16 (FP16 Quantizing to an INT4 model might be difficult if post-training quantization is inaccurate or the training pipeline (e.g., data or incentives) is unavailable for quantization-aware training.

We solve these problems using quantization-aware training and knowledge distillation to create an accurate and compact INT4 model. Perplexity, a traditional LLM generation capacity measure, dropped by less than 1%, while reasoning test accuracy dropped by less than 1%.

Speculation decoding

Since LLMs are memory limited, we may employ speculative decoding to speed up token rate by trading compute for memory bandwidth, enhancing user experience. Speculative decoding uses a draft model, which is much smaller than the high-accuracy target model, to swiftly and consecutively create speculative tokens, which the original target model checks and corrects.

Consider a draft mechanism that creates three speculative tokens consecutively. The target model then processes all three speculative tokens in one model pass (a batch), reading model parameters just once. Token acceptance is determined by the goal model. Due to LLMs’ memory constraints, speculative decoding may speed them faster.

Our implementation of speculative decoding on a phone using the Llama 2-7B Chat model was the first. Optimising Llama 2-7B using speculative decoding and other research methods allowed Snapdragon 8 Gen 3 phones to talk at 20 tokens per second.

The quickest Llama 2-7B on a phone was accomplished, showcasing communication with a device-only AI helper.

Continuously improving on-device generative AI

The edge of AI processing is gaining importance to scale. On-device AI is essential for hybrid and global generative AI. Processing will be shared between cloud and edge devices based on device capabilities, privacy and security needs, performance requirements, and business models.

They keep optimizing our solutions using full-stack On-Device AI and pushing edge technology. They have greatly enhanced edge device generative AI in less than 10 months. he will excited to push it farther and see what the ecosystem builds with it.

RELATED ARTICLES

2 COMMENTS

  1. […] The ASRock Technologies AI QuickSet The program was is a software product that was developed alongside the intention to facilitate the procedure of configuring and setting up artificial intelligence (AI) software easier. A simple interface guides users through the procedures of downloading, configuring, and installing various components of supported artificial intelligence (AI) applications. […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes