Saturday, March 15, 2025

Intel VTune Profiler For Data Parallel Python Applications

Intel VTune Profiler tutorial

This brief tutorial will show you how to use Intel VTune Profiler to profile the performance of a Python application using the NumPy and Numba example applications.

Analysing Performance in Applications and Systems

For HPC, cloud, IoT, media, storage, and other applications, Intel VTune Profiler optimises system performance, application performance, and system configuration.

  • Optimise the performance of the entire application not just the accelerated part using the CPU, GPU, and FPGA.
  • Profile SYCL, C, C++, C#, Fortran, OpenCL code, Python, Google Go, Java,.NET, Assembly, or any combination of languages can be multilingual.
  • Application or System: Obtain detailed results mapped to source code or coarse-grained system data for a longer time period.
  • Power: Maximise efficiency without resorting to thermal or power-related throttling.

VTune platform profiler

It has following Features.

Optimisation of Algorithms

  • Find your code’s “hot spots,” or the sections that take the longest.
  • Use Flame Graph to see hot code routes and the amount of time spent in each function and with its callees.

Bottlenecks in Microarchitecture and Memory

  • Use microarchitecture exploration analysis to pinpoint the major hardware problems affecting your application’s performance.
  • Identify memory-access-related concerns, such as cache misses and difficulty with high bandwidth.

Inductors and XPUs

  • Improve data transfers and GPU offload schema for SYCL, OpenCL, Microsoft DirectX, or OpenMP offload code. Determine which GPU kernels take the longest to optimise further.
  • Examine GPU-bound programs for inefficient kernel algorithms or microarchitectural restrictions that may be causing performance problems.
  • Examine FPGA utilisation and the interactions between CPU and FPGA.
  • Technical summary: Determine the most time-consuming operations that are executing on the neural processing unit (NPU) and learn how much data is exchanged between the NPU and DDR memory.

In parallelism

  • Check the threading efficiency of the code. Determine which threading problems are affecting performance.
  • Examine compute-intensive or throughput HPC programs to determine how well they utilise memory, vectorisation, and the CPU.

Interface and Platform

  • Find the points in I/O-intensive applications where performance is stalled. Examine the hardware’s ability to handle I/O traffic produced by integrated accelerators or external PCIe devices.
  • Use System Overview to get a detailed overview of short-term workloads.

Multiple Nodes

  • Describe the performance characteristics of workloads involving OpenMP and large-scale message passing interfaces (MPI).
  • Determine any scalability problems and receive suggestions for a thorough investigation.

Intel VTune Profiler

  • To improve Python performance while using Intel systems, install and utilise the Intel Distribution for Python and Data Parallel Extensions for Python with your applications.
  • Configure your Python-using VTune Profiler setup.
  • To find performance issues and areas for improvement, profile three distinct Python application implementations. The pairwise distance calculation algorithm commonly used in machine learning and data analytics will be demonstrated in this article using the NumPy example.

The following packages are used by the three distinct implementations.

  • Numpy Optimised for Intel
  • NumPy’s Data Parallel Extension
  • Extensions for Numba on GPU with Data Parallelism

Python’s NumPy and Data Parallel Extension

By providing optimised heterogeneous computing, Intel Distribution for Python and Intel Data Parallel Extension for Python offer a fantastic and straightforward approach to develop high-performance machine learning (ML) and scientific applications.

Added to the Python Intel Distribution is:

  • Scalability on PCs, powerful servers, and laptops utilising every CPU core available.
  • Assistance with the most recent Intel CPU instruction sets.
  • Accelerating core numerical and machine learning packages with libraries such as the Intel oneAPI Math Kernel Library (oneMKL) and Intel oneAPI Data Analytics Library (oneDAL) allows for near-native performance.
  • Tools for optimising Python code into instructions with more productivity.
  • Important Python bindings to help your Python project integrate Intel native tools more easily.

Three core packages make up the Data Parallel Extensions for Python:

  • The NumPy Data Parallel Extensions (dpnp)
  • Data Parallel Extensions for Numba, aka numba_dpex
  • Tensor data structure support, device selection, data allocation on devices, and user-defined data parallel extensions for Python are all provided by the dpctl (Data Parallel Control library).

It is best to obtain insights with comprehensive source code level analysis into compute and memory bottlenecks in order to promptly identify and resolve unanticipated performance difficulties in Machine Learning (ML), Artificial Intelligence (AI), and other scientific workloads. This may be done with Python-based ML and AI programs as well as C/C++ code using Intel VTune Profiler. The methods for profiling these kinds of Python apps are the main topic of this paper.

Using highly optimised Intel Optimised Numpy and Data Parallel Extension for Python libraries, developers can replace the source lines causing performance loss with the help of Intel VTune Profiler, a sophisticated tool.

Setting up and Installing

1. Install Intel Distribution for Python

2. Create a Python Virtual Environment

   python -m venv pyenv

   pyenv\Scripts\activate

3. Install Python packages

   pip install numpy

   pip install dpnp

   pip install numba

   pip install numba-dpex

   pip install pyitt

Make Use of Reference Configuration

The hardware and software components used for the reference example code we use are:

Software Components:

  • dpnp 0.14.0+189.gfcddad2474
  • mkl-fft 1.3.8
  • mkl-random 1.2.4
  • mkl-service 2.4.0
  • mkl-umath 0.1.1
  • numba 0.59.0
  • numba-dpex 0.21.4
  • numpy 1.26.4
  • pyitt 1.1.0

Operating System:

  • Linux, Ubuntu 22.04.3 LTS

CPU:

  • Intel Xeon Platinum 8480+

GPU:

  • Intel Data Center GPU Max 1550

The Example Application for NumPy

Intel will demonstrate how to use Intel VTune Profiler and its Intel Instrumentation and Tracing Technology (ITT) API to optimise a NumPy application step-by-step. The pairwise distance application, a well-liked approach in fields including biology, high performance computing (HPC), machine learning, and geographic data analytics, will be used in this article.

Summary

The three stages of optimisation that we will discuss in this post are summarised as follows:

Step 1: Examining the Intel Optimised Numpy Pairwise Distance Implementation: Here, we’ll attempt to comprehend the obstacles affecting the NumPy implementation’s performance.

Step 2: Profiling Data Parallel Extension for Pairwise Distance NumPy Implementation: We intend to examine the implementation and see whether there is a performance disparity.

Step 3: Profiling Data Parallel Extension for Pairwise Distance Implementation on Numba GPU: Analysing the numba-dpex implementation’s GPU performance

Boost Your Python NumPy Application

Intel has shown how to quickly discover compute and memory bottlenecks in a Python application using Intel VTune Profiler.

  • Intel VTune Profiler aids in identifying bottlenecks’ root causes and strategies for enhancing application performance.
  • It can assist in mapping the main bottleneck jobs to the source code/assembly level and displaying the related CPU/GPU time.
  • Even more comprehensive, developer-friendly profiling results can be obtained by using the Instrumentation and Tracing API (ITT APIs).
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes