Intel Neural Compressor
Increase AI Inference Speed without Losing Accuracy.
Implement Deep Learning Models That Are More Effective
For deployment on CPUs, GPUs, or Intel Gaudi AI accelerators, Intel Neural Compressor optimizes the model to minimize its size and speed up deep learning inference. Across several deep learning frameworks, this open source Python library automates well-known model optimization techniques including quantization, pruning, and knowledge distillation.
With this library, you are able to:
- Accuracy-driven tuning procedures that are automated allow for rapid convergence on quantized models.
- For big models, prune the least significant parameters.
- To increase the accuracy of a smaller model for deployment, extract knowledge from a bigger model.
- The whole set of Intel AI and machine learning development tools and resources includes the Intel Neural Compressor.
Get the AI Tools here
The AI Tools Selector offers quicker machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python modules, like Intel Neural Compressor.
Get the Stand-Alone Version here
Intel Neural Compressor can be downloaded separately. You can select your favorite repository or get binaries from Intel.
Features

Methods of Model Optimization
Reduce the size of the model and speed up inference while minimizing precision loss by quantizing activations and weights to int8, FP8, or a combination of FP32, FP16, FP8, bfloat16, and int8. Depending on the runtime data range, quantize dynamically, during training, or after training.
- Reduce the size of a model by pruning factors that have little bearing on accuracy. Set up the timetable, criteria, and pruning patterns.
- Adjust quantization and pruning automatically to achieve desired accuracy.
- To increase the compressed model’s accuracy, condense information from a bigger model (“teacher”) to a smaller model (“student”).
- For low-bit inference, customize quantization using sophisticated methods like weight-only quantization (WOQ), layer-wise quantization, and SmoothQuant.
Automation
- Utilize built-in methods to automatically apply quantization approaches to operations and meet goals with expected accuracy standards.
- Use one-shot optimization orchestration in conjunction with several model optimization strategies.
Interoperability
- Optimize TensorFlow or PyTorch models and export them.
- Use Intel Neural Compressor 2.x to optimize and export Open Neural Network Exchange (ONNX) Runtime models. For integrated cross-platform deployment, Intel Neural Compressor is up streamed into open source ONNX as of version 3.x.
- To set up and adjust model compression, use well-known PyTorch, TensorFlow, or Hugging Face Transformer-style APIs.
Examples of Cases
Palo Alto Networks Has a 6x Lower Inference Latency
Palo Alto Networks used sophisticated instruction sets and accelerators to quantize its models to int8 in order to provide the necessary reaction speed for a variety of cybersecurity models.
AI that is sustainable with Intel-optimized hardware and software
Through a series of studies, HPE Services was able to cut energy usage by at least 68% by utilising Intel AI software in conjunction with int8 post-training static quantization.
Delphai Quickens Search Engine Natural Language Processing Models
Delphai’s quantization of its models to int8 let it to deploy less expensive CPU-based cloud instances while increasing inference speed without compromising accuracy.
Demonstrations
Quantization (MX) Micro scaling
Balance accuracy and memory use when quantifying Microsoft Floating Point (MSFP) data types to 8-, 6-, or 4-bit MX data types.
The Algorithm for AutoRound Quantization
For well-known large language models (LLMs), achieve almost lossless weight-only quantization (WOQ) compression.
Utilize SmoothQuant to Quantify LLMs
Large-magnitude outliers in certain activation channels are common in LLMs. Discover how to quantize a Hugging Face Transformer model to 8-bit using the SmoothQuant approach, which takes care of this.
Use Only a Few Lines of Code to Quantify Big Language Models
Inference is accelerated by up to 8x model size reduction when LLMs are quantized to int4. Discover how to begin using weight-only quantization (WOQ) and see how it affects common LLMs in terms of accuracy.
Separate and Measure BERT Text Categorization
Utilising the Stanford Sentiment Treebank 2 (SST-2) dataset, quantize the BERT base model and perform knowledge distillation. Inference is up to 16 times quicker with the resulting BERT-Mini model.
PyTorch Quantization Using Fine-Grained Effects
Perform dynamic quantization, quantization-aware training, or post-training static quantization after converting an imperative model to a graph model.
Intel Neural Compressor Specs
Specifications | Details |
---|---|
Processor | Intel Xeon processor |
Intel Xeon CPU Max Series | |
Intel Core Ultra processor | |
Intel Gaudi AI accelerator | |
Intel Data Center GPU Max Series | |
Operating Systems | Linux |
Windows | |
Language | Python |