Intel Neural Compressor: AI-Optimized Simple Quantization

By Drakshi

January 4, 2025

0

66

Intel Neural Compressor

Increase AI Inference Speed without Losing Accuracy.

Implement Deep Learning Models That Are More Effective

For deployment on CPUs, GPUs, or Intel Gaudi AI accelerators, Intel Neural Compressor optimizes the model to minimize its size and speed up deep learning inference. Across several deep learning frameworks, this open source Python library automates well-known model optimization techniques including quantization, pruning, and knowledge distillation.

With this library, you are able to:

Accuracy-driven tuning procedures that are automated allow for rapid convergence on quantized models.
For big models, prune the least significant parameters.
To increase the accuracy of a smaller model for deployment, extract knowledge from a bigger model.
The whole set of Intel AI and machine learning development tools and resources includes the Intel Neural Compressor.

Get the AI Tools here

The AI Tools Selector offers quicker machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python modules, like Intel Neural Compressor.

Get the Stand-Alone Version here

Intel Neural Compressor can be downloaded separately. You can select your favorite repository or get binaries from Intel.

Features

Methods of Model Optimization

Reduce the size of the model and speed up inference while minimizing precision loss by quantizing activations and weights to int8, FP8, or a combination of FP32, FP16, FP8, bfloat16, and int8. Depending on the runtime data range, quantize dynamically, during training, or after training.

Reduce the size of a model by pruning factors that have little bearing on accuracy. Set up the timetable, criteria, and pruning patterns.
Adjust quantization and pruning automatically to achieve desired accuracy.
To increase the compressed model’s accuracy, condense information from a bigger model (“teacher”) to a smaller model (“student”).
For low-bit inference, customize quantization using sophisticated methods like weight-only quantization (WOQ), layer-wise quantization, and SmoothQuant.

Automation

Utilize built-in methods to automatically apply quantization approaches to operations and meet goals with expected accuracy standards.
Use one-shot optimization orchestration in conjunction with several model optimization strategies.

Interoperability

Optimize TensorFlow or PyTorch models and export them.
Use Intel Neural Compressor 2.x to optimize and export Open Neural Network Exchange (ONNX) Runtime models. For integrated cross-platform deployment, Intel Neural Compressor is up streamed into open source ONNX as of version 3.x.
To set up and adjust model compression, use well-known PyTorch, TensorFlow, or Hugging Face Transformer-style APIs.

Examples of Cases

Palo Alto Networks Has a 6x Lower Inference Latency

Palo Alto Networks used sophisticated instruction sets and accelerators to quantize its models to int8 in order to provide the necessary reaction speed for a variety of cybersecurity models.

AI that is sustainable with Intel-optimized hardware and software

Through a series of studies, HPE Services was able to cut energy usage by at least 68% by utilising Intel AI software in conjunction with int8 post-training static quantization.

Delphai Quickens Search Engine Natural Language Processing Models

Delphai’s quantization of its models to int8 let it to deploy less expensive CPU-based cloud instances while increasing inference speed without compromising accuracy.

Demonstrations

Quantization (MX) Micro scaling

Balance accuracy and memory use when quantifying Microsoft Floating Point (MSFP) data types to 8-, 6-, or 4-bit MX data types.

The Algorithm for AutoRound Quantization

For well-known large language models (LLMs), achieve almost lossless weight-only quantization (WOQ) compression.

Utilize SmoothQuant to Quantify LLMs

Large-magnitude outliers in certain activation channels are common in LLMs. Discover how to quantize a Hugging Face Transformer model to 8-bit using the SmoothQuant approach, which takes care of this.

Use Only a Few Lines of Code to Quantify Big Language Models

Inference is accelerated by up to 8x model size reduction when LLMs are quantized to int4. Discover how to begin using weight-only quantization (WOQ) and see how it affects common LLMs in terms of accuracy.

Separate and Measure BERT Text Categorization

Utilising the Stanford Sentiment Treebank 2 (SST-2) dataset, quantize the BERT base model and perform knowledge distillation. Inference is up to 16 times quicker with the resulting BERT-Mini model.

PyTorch Quantization Using Fine-Grained Effects

Perform dynamic quantization, quantization-aware training, or post-training static quantization after converting an imperative model to a graph model.

Intel Neural Compressor Specs

Specifications	Details
Processor	Intel Xeon processor
	Intel Xeon CPU Max Series
	Intel Core Ultra processor
	Intel Gaudi AI accelerator
	Intel Data Center GPU Max Series
Operating Systems	Linux
	Windows
Language	Python