New PyTorch 2.7 On Intel GPUs Performance For AI Workflows

PyTorch 2.7 is now available with significant improvements for Intel GPUs and other features. Version 2.7 of PyTorch, which brings substantial functionality and performance improvements across multiple platforms, including significant increases for Intel GPU architectures, has been announced. This release, which consists of 3262 commits from 457 contributors since PyTorch 2.6, attempts to improve the accessibility of AI research and development on a broader range of hardware and streamline AI workflows.

Enhancing Intel GPU Performance

Building on consistent advancements since PyTorch 2.4, PyTorch 2.7 places a strong emphasis on improving Intel GPU acceleration. With a single GPU programming paradigm, the objective is to offer a consistent user experience across several operating systems, such as Windows, Linux, and Windows Subsystem for Linux (WSL2).

PyTorch 2.7 provides confirmed support for Intel GPUs in both eager mode (torch.compile) and graph mode (Windows and Linux). Users may already have access to a variety of Intel GPU products. Now included in the increased support are:

  • Graphics from the Intel Arc A-Series and Intel Arc B-Series
  • Intel Arc Graphics-equipped Intel Core Ultra Processors
  • Intel Arc Graphics-equipped Intel Core Ultra Mobile Processors (Series 2)
  • Intel Data Centre GPU Max Series
  • Intel Core Ultra Desktop Processors (Series 2) with Intel Arc Graphics

Torch-xpu PIP wheels and simpler configuration have made installation easier. High ATen operation coverage with SYCL and oneDNN offers increased eager mode support, guaranteeing better performance and functionality.

The optimization of scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 data types is one of the noteworthy performance enhancements in PyTorch 2.7 for Intel GPUs. This is specifically meant to speed up attention-based models. In eager mode, the new SDPA optimization for Stable Diffusion float16 inference achieves up to a 3x improvement over the PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 258V, Intel Arc Graphics 140V.

The development of torch is another important breakthrough.compile for Intel GPUs on Windows 11. Intel GPUs are now the first accelerators to support Torch, which is a significant milestone.run on Windows. With this innovation, the Windows platform can now benefit from the performance benefits of graph mode compilation, which was previously only accessible on Linux. Similar notable gains with torch are seen in inference and training.compile over eager mode on Windows, as seen by performance results obtained on an Intel Arc B580 Graphics using the PyTorch Dynamo Benchmarking Suite.

Additional improvements for Intel GPUs include

  • PyTorch 2 Export Post Training Quantization (PT2E) performance optimization, which offers full graph mode quantization pipelines with increased computational efficiency.
  • AOTInductor and torch.export are enabled on Linux to streamline deployment procedures.
  • Adding more Aten operators to boost eager mode performance and enhance operator execution continuity.
  • The ability to use a profiler on Linux and Windows to help developers analyze the performance of models.

In order to achieve state-of-the-art PyTorch-native performance, particularly for GEMM computational efficiency with torch, future work on Intel GPU support is planned.assemble and improve LLM model performance with FlexAttention and data types with less precision.

Additionally, efforts will concentrate on improving accelerator support across key components of the PyTorch ecosystem, such as torchao, torchtune, and torchtitan, and enabling distributed XCCL backend support for Intel Data Centre GPU Max Series. Developers can monitor developments on GitHub and the PyTorch Dev Discussion.

For the Intel Core Ultra 7 258V and Intel Core Ultra 5 245KF with Intel Arc graphics, comprehensive setup specifications are given, including CPU information, GPU RAM, operating system versions, and driver versions.

Important Features of PyTorch 2.7

In addition to Intel GPU acceleration, PyTorch 2.7 adds a number of noteworthy features:

Support for the Blackwell GPU architecture from NVIDIA

The latest version of PyTorch 2.7 incorporates pre-built wheels for CUDA 12.8 spanning Linux x86 and arm64 architectures and supports NVIDIA’s new Blackwell architecture. This required updating essential parts for compatibility, such as cuDNN, NCCL, and CUTLASS.

Triton 3.3, which provides support for the Blackwell architecture with compatibility for torch.compile, is also included in the release. CUDA 12.8 can be installed by users with a special pip script.

Torch Function Modes are supported by Torch.compile

This feature has a beta designation. In order to implement bespoke user-defined behaviour, such as rewriting operations to accommodate a particular backend, users can override any torch action. FlexAttention uses this to rewrite indexing operations.

Mega Cache

Mega Cache offers end-to-end portable caching for Torch and is also a beta feature. After compiling and running a model, users can save compiler artefacts to load later possibly on a separate machine to pre-populate torch.speed up later compilations by compiling caches.

Native Context Parallel in PyTorch

The PyTorch Context Parallel API, which was first released as a Prototype feature, enables users to construct a Python context in which all calls to torch.nn.functional.scaled_dot_product_attention() execute with context parallelism. At the moment, it supports cuDNN attention backends, Efficient attention, and Flash attention. TorchTitan’s Context Parallel approach for LLM training makes use of this functionality.

FlexAttention enhancements

In PyTorch 2.7, FlexAttention which was first included in PyTorch 2.5.0 to allow researchers to modify attention kernels without creating kernel code sees a number of enhancements. These consist of:

Processing LLM tokens on x86 CPUs is the initial step

This release expands upon PyTorch 2.6’s support for x86 CPUs by including additional attention alternatives that are essential for LLM inference initial token processing. By replacing particular scaled_dot_product_attention operators with a single FlexAttention API and enjoying strong performance with torch, it provides a more seamless experience when using FlexAttention on x86 CPUs.assemble.

Optimisation of LLM throughput mode for x86 CPUs

By implementing a new C++ micro-GEMM template capability, PyTorch 2.6’s bottlenecks for big batch size situations have been addressed and performance for LLM inference throughput scenarios on x86 CPUs has been enhanced. When utilising FlexAttention APIs and torch, users enjoy improved performance and a more seamless experience.Compile for x86 CPUs with LLM throughput.

  • Adjust your focus to infer. This release adds a decoding backend that supports PagedAttention and GQA and is specifically optimized for inference. Updated features including support for trainable biases, performance tweaking guidelines, and layered jagged tensors are also included.

Examine each map

Foreach Map is a prototype feature that makes use of torch.Similar to the current torch, compile allows users to apply user-defined or pointwise functions (such as torch.add) to lists of tensors.foreach operations.

  • Its ability to handle lists of tensors or a variety of scalars as parameters, as well as lift user-defined Python functions, is one of its advantages. For best results, Torch.compile automatically creates a horizontally fused kernel.

Inductor Prologue Fusion Support

Prologue fusion is another prototype feature that optimizes matrix multiplication (matmul) processes by integrating pre-matmul operations into the matmul kernel. This lowers the global memory bandwidth, which enhances performance.

In Conclusion

The release overview emphasizes how far upstream efforts on Intel GPUs have come since PyTorch 2.4 and how new features in PyTorch 2.7 have accelerated the performance of AI workloads on a variety of Intel GPUs. The performance benefits of torch.compile on Windows and the notable speed improvements seen for Stable Diffusion inference on various Intel Arc and Core Ultra setups using SDPA optimization are highlighted in particular.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index