Using the Intel Extension for PyTorch to Speed Up PyTorch. Enhance PyTorch Performance with an Open-Source Extension.
Intel Extension for PyTorch
To enhance deep learning (DL) training and inference performance, Intel engineers collaborate with the PyTorch open-source community. An open-source addon called the Intel addon for PyTorch enhances DL performance on Intel CPUs. Although the extension gives PyTorch users faster access to the latest features and optimizations, many of the optimizations will eventually be incorporated into future PyTorch core releases. Soon, Intel GPUs will be supported by the Intel Extension for PyTorch in addition to CPUs.
Both imperative mode and graph mode are optimized by the Intel Extension for PyTorch. The runtime, graph, and PyTorch operators are all optimized. The PyTorch dispatching technique is used to register optimized operators and kernels. Intel Extension for PyTorch provides an additional collection of custom operators and optimizers for common use-cases and replaces a selection of ATen operators with their optimized counterparts during execution.
To optimize speed, more graph optimization passes are conducted in graph mode. The runtime extension module, which offers a few PyTorch frontend APIs for users to have more precise control over the thread runtime, contains runtime optimizations.

A Look at the Improvements
A basic optimisation for operators linked to vision is memory layout. The performance of PyTorch models can be greatly enhanced by using the appropriate memory format for input tensors. Generally speaking, “Channels last memory format” is advantageous for several hardware backends:
- (Beta) Channels PyTorch Last Memory Format
- Effective PyTorch: The Format of Tensor Memory Is Important
- Comprehending Memory Formats
In order to improve vectorization and cache usage, the oneAPI Deep Neural Network Library (oneDNN) implements a blocked memory layout for weights. To convert weights to a predetermined optimal block format before oneDNN operators are executed in order to prevent runtime conversion. This method, known as weight prepacking, is activated for both training and inference when users invoke the extension’s ipex.optimize frontend API.
For recommendation models like DLRM, ROIAlign, and FrozenBatchNorm for object detection workloads, the Intel Extension for PyTorch offers a number of customized operators to speed up common topologies, such as fused interaction and merged embedding bag.
It include finely tuned fused and split optimizers in the Intel Extension for PyTorch since optimizers are crucial to training performance. To avoid requiring customers to modify their model code, it offer the fused kernels for Lamb, Adagrad, and SGD via the ipex.optimize frontend. In order for the data to live in cache without having to be loaded from memory again, the kernels fuse the chain of memory-bound operators on model parameters and their gradients in the weight update phase. In the next extension releases, it plan to include more fused optimizers.
Through faster calculation, less memory bandwidth pressure, and lower memory usage, BF16 mixed precision training provides a notable performance increase. But as training progresses, weight updates would get too little to accumulate. Maintaining a master copy of weights in FP32 is a frequent practice that doubles the memory requirement. To use a “split” optimization for BF16 training since the increased memory usage strains workloads where multiple weights are needed, such as recommendation models. The top and bottom portions of the FP32 parameters were separated. The first 16 bits, or the top half, are precisely equivalent to a BF16 number.
The final 16 bits, which maintain accuracy, make up the bottom half. Native BF16 support on Intel CPUs helps the top half when forward and backward propagations are being performed. In order to restore the parameters back to FP32 during parameter updates, the concatenate the top and bottom halves, preventing accuracy loss.
Practitioners of deep learning have shown that decreased numerical precision can still be beneficial. Accuracy is maintained but training and inference performance are enhanced by using 16-bit multipliers with 32-bit accumulators. For certain inference tasks, even 8-bit multipliers with 32-bit accumulators function well. Performance is enhanced by lower precision in two ways: compute-bound operations are boosted by the greater multiply-accumulate throughput, and memory bandwidth-bound operations are boosted by the smaller footprint, which reduces memory transactions in the memory hierarchy.
With the BF16→ FP32 fused multiply-add (FMA) and FP32→BF16 conversion Intel Advanced Vector Extensions-512 (Intel AVX-512) instructions, which quadruple the theoretical computing throughput above FP32 FMAs, Intel added native BF16 support to its third generation Intel Xeon Scalable processors. In the upcoming generation of Intel Xeon Scalable processors, the Intel Advanced Matrix Extensions (Intel AMX) instruction set will significantly speed BF16.
In deep networks, quantization is the process of compressing information by decreasing the numerical precision of its weights and/or activations. The model is made smaller and the memory and computation requirements are significantly reduced by translating the parameter information from FP32 to INT8. In the second generation of Xeon Scalable processors, Intel unveiled the AVX-512 VNNI instruction set extension. Throughput is increased and INT8 data is computed more quickly. PyTorch provides several methods for quantizing models.
By optimizing the total computation and memory bandwidth, graph optimizations such as operator fusion maximize the performance of the underlying kernel implementations. The fusion capability in oneDNN and the specialized fused kernels in the extension enable the use of operator fusion passes based on the TorchScript IR in the Intel Extension for PyTorch. Users can see the entire optimization process. Constant-folding is a compile-time graph optimization technique that substitutes precomputed constant nodes for operators with constant inputs.
For many models, the performance gains from using Convolution+BatchNorm folding for inference are not negligible. This advantage is provided to users by the IPEX. Optimize frontend API. To obtain the best of both, Its are collaborating with the PyTorch community to improve the fusion capacity with PyTorch NNC (Neural Network Compiler).