On-Device AI Magic: Unlock Advanced Strategies

November 8, 2023

188

Mizuho & IBM Partner on Gen AI PoC to Boost Recovery Times

Page Contents

Advanced strategies for rapid, accurate, and efficient on-device generative AI models

With generative AI use rising at historic rates and compute needs rising, on-device AI processing is crucial. The first Android phone-based Stable Diffusion demo was shown at MWC 2023. Much development has been achieved since then.

After Snapdragon Summit 2023, we exhibited some stunning On-Device AI demonstrations on Snapdragon 8 Gen 3 Mobile Platform smartphones and Snapdragon X Elite Platform computers.they will show you how Qualcomm AI Research uses sophisticated methodologies and full-stack On-Device AI optimizations to make on-device generative AI demo experiences quick and efficient.

Effective On-Device AI via knowledge distillation and quantization-aware training

Knowledge distillation and quantization helped us shrink, speed up, and improve models.

Knowledge distillation is a transfer learning strategy that trains a smaller “student” model to replicate a bigger “teacher” model with high accuracy. Matching model logits transfers instructor model representation to student and minimizes distillation loss. Knowledge distillation provides a smaller model for quicker inference without using the teacher’s training process.

Quantization decreases weight parameter and activation calculation bit accuracy. They developed basic methods to decrease quantization errors, many of which are available in the AI Model Efficiency Toolkit (AIMET) on the Qualcomm AI Stack and GitHub. Post-training quantization needs no training but may not reach the requisite accuracy at certain bit precisions. Quantization-aware training simulates quantization during fine-tuning to decrease loss and improve model accuracy.

Creating the quickest phone Stable Diffusion

Our Stable Diffusion implementation ran in under a second at Snapdragon Summit. We accomplished it how? As its name says, Stable Diffusion creates visuals using reverse diffusion conditioned on the input prompt. The bottleneck must be found to speed it up.

The largest component model in the Stable Diffusion architecture is the UNet, which is repeated multiple times to denoise the picture until the output image is formed. Image quality sometimes requires 20 or more processes. UNet repetition is the computing bottleneck. Our system improvements reduced denoising calculations.

Our study used knowledge distillation, guidance conditioning, and step distillation to create an efficient UNet. We saved computation by pruning the attention blocks in the first layer of the UNet, and knowledge distillation restored model accuracy. Knowledge distillation again reduced the number of stages by teaching the student model to match a teacher’s correctness, which takes several steps.

Results: quick, high-quality picture production

We created the quickest phone diffusion-based text-to-image generation by combining these methods. Running on a Snapdragon 8 Gen 3 phone, we got a 9x speedup against the baseline Stable Diffusion and over 25x faster than at MWC 2023.

As seen in the photos above, CLIP and FID perceptual metrics and qualitative findings were encouraging for image quality. Generating high-quality photos in under a second improves user experience and expands generative AI application cases.

Fastest phone Llama 2-7B ever

To optimize a big language model (LLM) like Llama 2, find its bottlenecks. Since LLMs are autoregressive, their output values rely on their past values, hence processing must be sequential.

LLMs create one token every inference, which is a linguistic unit like words, numerals, and punctuation. Each response token is generated using all LLM parameters. Llama 2-7B requires seven billion parameters to create each token, which requires a lot of bandwidth to read these weights for calculations. In addition, LLMs must create several tokens for chatbot answers such a few words in seconds.

So, LLMs are bandwidth constrained rather than computation limited since memory bandwidth is the barrier. This inspired us to find the finest device LLM acceleration methods. Knowledge distillation, quantization-aware training, and speculative decoding lowered memory bandwidth.

We optimized full-stack AI for fast Llama 2 via knowledge distillation, quantization-aware training, and speculative decoding.

Quantization-aware knowledge distillation training

To compress the model by 4x, we want Llama 2’s parameters to be 4-bit integer (INT4). It was trained using floating-point 16 (FP16 Quantizing to an INT4 model might be difficult if post-training quantization is inaccurate or the training pipeline (e.g., data or incentives) is unavailable for quantization-aware training.

We solve these problems using quantization-aware training and knowledge distillation to create an accurate and compact INT4 model. Perplexity, a traditional LLM generation capacity measure, dropped by less than 1%, while reasoning test accuracy dropped by less than 1%.

Speculation decoding

Since LLMs are memory limited, we may employ speculative decoding to speed up token rate by trading compute for memory bandwidth, enhancing user experience. Speculative decoding uses a draft model, which is much smaller than the high-accuracy target model, to swiftly and consecutively create speculative tokens, which the original target model checks and corrects.

Consider a draft mechanism that creates three speculative tokens consecutively. The target model then processes all three speculative tokens in one model pass (a batch), reading model parameters just once. Token acceptance is determined by the goal model. Due to LLMs’ memory constraints, speculative decoding may speed them faster.

Our implementation of speculative decoding on a phone using the Llama 2-7B Chat model was the first. Optimising Llama 2-7B using speculative decoding and other research methods allowed Snapdragon 8 Gen 3 phones to talk at 20 tokens per second.

The quickest Llama 2-7B on a phone was accomplished, showcasing communication with a device-only AI helper.

Continuously improving on-device generative AI

The edge of AI processing is gaining importance to scale. On-device AI is essential for hybrid and global generative AI. Processing will be shared between cloud and edge devices based on device capabilities, privacy and security needs, performance requirements, and business models.

They keep optimizing our solutions using full-stack On-Device AI and pushing edge technology. They have greatly enhanced edge device generative AI in less than 10 months. he will excited to push it farther and see what the ecosystem builds with it.

2 COMMENTS

AI QuickSet Tech Now One Click Away: Revolutionize ASRock! November 18, 2023 At 2:20 pm

[…] The ASRock Technologies AI QuickSet The program was is a software product that was developed alongside the intention to facilitate the procedure of configuring and setting up artificial intelligence (AI) software easier. A simple interface guides users through the procedures of downloading, configuring, and installing various components of supported artificial intelligence (AI) applications. […]

Reply
AI Unveiled: 2024's Top 6 Consumer Tech Trends December 16, 2023 At 10:38 am

[…] 2023’s generative AI conversation focused on the cloud, on-device AI can help solve privacy, latency, and cost […]

Reply

On-Device AI Magic: Unlock Advanced Strategies

Advanced strategies for rapid, accurate, and efficient on-device generative AI models

Effective On-Device AI via knowledge distillation and quantization-aware training

Creating the quickest phone Stable Diffusion

Results: quick, high-quality picture production

Fastest phone Llama 2-7B ever

Quantization-aware knowledge distillation training

Speculation decoding

Continuously improving on-device generative AI

Farming Simulator 25: The Ultimate Agricultural Experience

Asus ExpertCenter PN43 Mini Computer Is Powerful

The Dustborn Game: An Epic Journey Awaits

2 COMMENTS

LEAVE A REPLY Cancel reply

Recent Posts

Farming Simulator 25: The Ultimate Agricultural Experience

Asus ExpertCenter PN43 Mini Computer Is Powerful

The Dustborn Game: An Epic Journey Awaits

Epic Accuses Apple of Blocking Epic Games Store Launch

Blackview Hero 10: A Compact and Economical Powerhouse

AI Improves Pixel In 6 Ways Include Pixel Circle To Search

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

Cardea Z540 SSD Revolutionizes Storage

What is Azure Policy in Microsoft Azure

MSI Motherboards with Intel Application Optimization

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

About Us

POPULAR CATEGORY