Improve AI/ML Application Performance with Intel VTune Profiler.
Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.
Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.
It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.
These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.
The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.
Use Intel VTune Profiler to Boost Python Application Performance
VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.
The following are the recipe’s fundamental software requirements:
- Data Parallel Extensions for Python
- Vtune Profiler (version 2022 or later)
- Intel Distribution for Python
- Compiler for Intel OneAPI DPC++/C++
The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.
Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.
Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks
According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.
Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.
Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance
Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).
If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.
The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.
Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.
For example,
- To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.
- Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.
- The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.
Intel VTune Profiler Download
Use one of the following methods to install Intel VTune Profiler on your computer:
- Get the standalone bundle here.
- As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.
Know the Process
In the standard software performance analysis process,
Use one of the following methods to launch Intel VTune Profiler:
- Using Microsoft Visual Studio’s GUI From the Command Line
- Configure parameters and choose a profiling analysis for your application.
- Create a profile of the target system (remote collection) or application on the host (local collection).
- View the host system’s findings.
- Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.
FAQs
What does Intel VTune do?
Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.