Thursday, December 19, 2024

SmoothQuant: Simple Quantization For Large Language Models

- Advertisement -

Smaller is Better: Intel Xeon Processors Offer an Effective Generative AI Experience with Q8-Chat LLM. SmoothQuant offers scalable methods for optimizing big language models in AI applications by introducing effective quantization to LLMs.

The field of machine learning is exploding with large language models (LLMs). LLMs have the remarkable capacity to learn from enormous volumes of unstructured data, like as text, photos, video, or audio, because of their transformer architecture. They excel at a wide range of tasks, including generative ones like text summarization and text-to-image creation and extractive ones like text categorization.

- Advertisement -

LLMs are big models, as their name suggests, and frequently include more than 10 billion parameters. Some, like the BLOOM model, contain over 100 billion parameters. For low-latency use cases like search or conversational applications, LLMs need a lot of processing power, which is usually available in high-end GPUs. Unfortunately, many organizations find it impossible to incorporate state-of-the-art LLMs in their applications because to the accompanying expenses, which can be exorbitant.

Over optimization strategies in this post that help LLMs operate more effectively on Intel CPUs by lowering their size and inference latency.

An Introduction to Quantization

16-bit floating point parameters, often known as FP16/BF16, are typically used for LLM training. Therefore, two bytes of RAM are needed to store the value of a single weight or activation value. Furthermore, compared to integer arithmetic, floating point arithmetic is slower, more complicated, and demands more processing resources.

By limiting the range of possible values for the model parameters, quantisation is a model-compression approach that seeks to address both issues. For example, you may compress models by quantising them to a lower precision, such as 8-bit integers (INT8), and substitute quicker and simpler integer operations for intricate floating-point operations.

- Advertisement -

To put it briefly, quantisation reduces the range of values for model parameters. When it works, it reduces the size of your model by at least two times without affecting its accuracy.

The greatest results are usually obtained when quantization is used during training, also known as quantization-aware training (QAT). Post-training quantization (PTQ) is a considerably quicker method that uses very little processing resources if you would rather quantize an existing model.

There are several quantization tools available. For instance, quantization is supported by PyTorch by default. The Hugging Face Optimum Intel library, which offers developer-friendly QAT and PTQ APIs, is another option.

Quantizing LLMs

Current quantization methods don’t function well with LLMs, according to recent studies. Specifically, across all layers and tokens, LLMs show large-magnitude outliers in key activation channels.

This is an illustration using the OPT-13B model. As you can see, one of the activation channels has values that are significantly higher than those of every other channel for every token. Every Transformer layer in the model exhibits this phenomena.

To date, the most effective quantization methods quantize activations token-wise, which results in either underflowing low-magnitude activations or truncated outliers. The model quality was severely harmed by both options. Furthermore, quantization-aware training necessitates extra model training, which is typically impractical owing to a lack of data and computational resources.

This issue is resolved by a novel quantization method called SmoothQuant. It performs a combined mathematical modification to both weights and activations, raising the ratio for weights while decreasing the ratio for activations between outlier and non-outlier values. The Transformer’s layers become “quantization-friendly” as a result of this modification, allowing 8-bit quantization without compromising model quality. As a result, SmoothQuant creates models that are quicker, smaller, and compatible with Intel CPU platforms.

Quantizing LLMs with SmoothQuant

A number of LLMs, including OPT 2.7B and 6.7B, LLaMA 7B, Alpaca 7B, Vicuna 7B, BloomZ 7.1B, and MPT-7B-chat, have been quantized using SmoothQuant-O3 by the friends at Intel. Additionally, they used Language Model Evaluation Harness to assess the quantized models’ correctness.

Their findings are summarized in the table below. The ratio of benchmarks with enhanced post-quantization is displayed in the second column. The mean average deterioration is shown in the third column; a negative number suggests that the benchmark has improved.

OPT models

OPT models are excellent candidates for SmoothQuant quantization, as you can see. The models are almost twice as tiny as pretrained 16-bit models. The majority of the measures become better, and those that don’t get far worse.

The image for BloomZ 7.1B and LLaMA 7B is slightly more contrasted. Models are compressed by approximately two times, and metric improvements are seen for almost half of the job. One job exhibits more than 3% relative deterioration, whereas the other half is barely slightly affected.

Working with smaller models has the obvious advantage of significantly lowering inference latency. Use the MPT-7B-chat model to generate text in real time on a single-socket Intel Sapphire Rapids CPU with 32 cores and a batch size of 1.

This prompts the model with the following information: “A conversation between an AI helper and an inquisitive user. The assistant responds to the user’s enquiries in a kind, thorough, and helpful manner.

USER: How can Hugging Face contribute to the democratization of NLP? Assistant: “

The example demonstrates the further advantages of 8bit quantization in conjunction with 4th Gen Intel Xeon CPUs, which lead to extremely short token production times. Running LLMs on CPU systems is undoubtedly made possible by this level of performance, providing clients with greater IT flexibility and cost-effectiveness than before.

Experience with Intel Xeon Processors via Chat

Businesses now have a new chance to reduce the cost of fine-tuning and inference in production because to the advent of comparatively smaller models like Alpaca, BloomZ, and Vicuna. As previously shown, high-quality quantisation enables high-quality chat experiences on Intel CPU platforms without requiring the use of intricate AI accelerators and massive LLMs.

Q8-Chat (pronounced “Cute chat”) is a brand-new, thrilling demo that it running in Spaces in collaboration with Intel. With a batch size of one and a single socket Intel Sapphire Rapids CPU with 32 cores, Q8-conversation provides a conversation experience similar to ChatGPT.

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes