oneDNN Graph API: Simplifying High-Performance AI Fusion

To optimise the potential for producing effective code on AI hardware, oneDNN Graph API expands oneDNN with a versatile graph API. The graph partitions that need to be accelerated through fusion are automatically identified. For both inference and training use cases, the fusion patterns concentrate on combining computationally demanding processes like convolution, matmul, and their neighbour operations.

OneDNN Graph can speed up inference on x86-64 CPUs (mostly Intel Xeon CPUs, based systems) with Float32 and BFloat16 datatypes (with PyTorch’s Automatic Mixed Precision support) in PyTorch 2.0 and later. Speedup using BFloat16 is restricted to machines that support both AMX_BF16 ISA and AVX512_BF16 ISA (Instruction Set Architecture).

oneDNN Graph Usage

With simply an API invocation requiring code modification, the usage is rather straightforward and intuitive from the user’s point of view. As seen in Figure 1 below, a model is profiled using an example input in order to utilise the oneDNN Graph with JIT-tracing.

oneDNN Graph with JIT-tracing.
Image credit to Intel

After receiving the model’s graph, oneDNN Graph finds potential operators for fusion based on the sample input’s input shape. Only static forms are supported at this time. This implies that there would be no support for or performance advantage for any other input shape.

Measurements

Intel measured the inference speed-up of various Vision models on an AWS m7i.16xlarge instance, which employs 4th Gen Intel Xeon Scalable processors, using a fork of TorchBench to assure reproducibility of results.

Torch.jit.optimize_for_inference, which only supports the Float32 datatype, served as the baseline for comparison. Each model’s batch size was determined by the batch size that TorchBench utilised for that particular model.

Figure 2 shows how employing oneDNN Graph speeds up inference compared to PyTorch alone. For the Float32 datatype, the geomean speedup with oneDNN Graph was 1.24x, and for the BFloat16 datatype, it was 3.31x.

oneDNN Graph was 1.24x, and for the BFloat16 datatype, it was 3.31x.
Image credit to Intel

Future work

Dynamo makes it easier to support dynamic shapes with PyTorch, and it would like to introduce dynamic shape support with Inductor-CPU. Currently, oneDNN Graph is supported in PyTorch through TorchScript, but Intel is already working on integrating it with the Inductor-CPU backend as a prototype feature in a future PyTorch release. It also intend to include support for int8 quantisation.

Acknowledgements 

The Intel PyTorch team and Meta collaborated to produce the outcomes shown in this blog. It would also want to thank Elias Ellison from Meta for taking the time to carefully analyse the PRs and provide us with insightful criticism.

What is oneDNN Graph?

An full computational graph, which represents the series of steps needed in deep learning inference, is optimised by the high-level abstraction known as oneDNN Graph. OneDNN Graph optimises a neural network model’s whole structure, including intricate sequences of operations, as opposed to only optimising individual operations (such as convolutions or matrix multiplications).

OneDNN can carry out more complex transformations because to this graph-based optimisation technique, including:

  • Fusion of Operations: To cut down on overhead, several operations are combined into one effective operation.
  • Layout optimisations involve rearranging memory layouts and data formats to improve cache use and data proximity.
  • Scheduling optimisations: figuring out the best sequence of actions to reduce latency and increase parallelism.

On x86-64 computers, oneDNN Graph can significantly speed up inference processes by utilising these sophisticated optimisations.

Benefits of Using oneDNN Graph for Inference on x86-64

Effective Use of Hardware: x86-64 processors, particularly those manufactured by Intel, provide a number of features like big caches, many cores, and SIMD (Single Instruction, many Data) instructions like AVX-512. Better parallelisation and faster calculation are made possible by oneDNN Graph‘s optimisation to fully utilise these hardware characteristics.

Graph-Level Optimisations

OneDNN Graph is capable of optimising complete computational graphs, in contrast to conventional deep learning libraries that concentrate on optimising individual operations. This results in faster execution and less memory utilisation, particularly for intricate models with numerous layers or components.

Hardware-agnostic Performance

Intel processors with varying capabilities (such as AVX-512 and AVX2) are among the hardware platforms that oneDNN Graph is made to function on. This eliminates the need to re-tune the program in order to scale inference performance across several Intel CPU generations.

Improved Memory Management

In deep learning inference, memory bandwidth and latency are frequently major obstacles. By utilising efficient data transfers and optimised memory layouts, oneDNN Graph improves memory bandwidth utilisation and minimises needless memory copies.

Multi-core Parallelism

Graph makes it simple to utilise the numerous CPU cores that are frequently seen in x86-64 platforms. On multi-core systems, it greatly increases throughput by effectively distributing the workload among the available cores.

Deep Integration with Current Ecosystems:

TensorFlow, PyTorch, and Apache MXNet are just a few of the well-known deep learning frameworks that oneDNN is compatible with. These frameworks can automatically benefit from oneDNN’s optimisations by using oneDNN Graph, which makes it easier for users to design and deploy applications.

In conclusion

When using the BFloat16 datatype and the right hardware, oneDNN Graph significantly improves inference speed on Intel x86-64 CPUs. For developers executing AI workloads on Intel platforms, this functionality is important due to its easy integration into PyTorch and the possibility of even larger speedups through future work. It may not be suitable for all use cases, though, due to its current limitation of static forms, and there would be no speed gain for any input other than the one used for tracing or optimisation.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index