Standard Parallel C++ Code Offload to SYCL Device Utilizing the Intel oneDPL (oneAPI DPC++ Library).
Enhance C++ Parallel STL methods with multi-platform parallel computing capabilities.
C++ algorithms may be executed in parallel and vectorized with to the Parallel Standard Template Library, sometimes known as Parallel STL or pSTL.
Utilizing the cross-platform parallelism capabilities of SYCL and the computational power of heterogeneous architectures, you may improve application performance by offloading Parallel STL algorithms to several devices (CPUs or GPUs) that support the SYCL programming framework. Multiarchitecture, accelerated parallel programming across heterogeneous hardware is made possible by the Intel oneAPI DPC++ Library (oneDPL), which allows you to offload Parallel STL code to SYCL devices.
The code example in this article will show how to offload C++ Parallel STL code to a SYCL device using the oneDPL pSTL_offload preview function.
Parallel API
As outlined in ISO/IEC 14882:2017 (often referred to as C++17) and C++20, the Parallel API in Intel oneAPI DPC++ Library (oneDPL) implements the C++ standard algorithms with execution rules. It provides data parallel execution on accelerators supported by SYCL in the Intel oneAPI DPC++/C++ Compiler, as well as threaded and SIMD execution of these algorithms on Intel processors built on top of OpenMP and oneTBB.
The Parallel API offers comparable parallel range algorithms that follow an execution strategy, extending the capabilities of range algorithms in C++20.
Furthermore, oneDPL offers particular iterations of a few algorithms, such as:
- Segmented reduction
- A segmented scan
- Algorithms for vectorized searches
- Key-value pair sorting
- Conditional transformation
Iterators and function object classes are part of the utility API. The iterators feature a counting and discard iterator, perform permutation operations on other iterators, zip, and transform. The function object classes provide identity, minimum, and maximum operations that may be supplied to reduction or transform algorithms.
An experimental implementation of asynchronous algorithms is also included in oneDPL.
Intel oneAPI DPC++ Library (oneDPL): An Overview
When used with the Intel oneAPI DPC++/C++ Compiler, oneDPL speeds up SYCL kernels for accelerated parallel programming on a variety of hardware accelerators and architectures. With the help of its Parallel API, which offers range-based algorithms, execution rules, and parallel extensions of C++ STL algorithms, C++ STL-styled programs may be efficiently executed in parallel on multi-core CPUs and offloaded to GPUs.
It supports libraries for parallel computing that developers are acquainted with, such Boost and Parallel STL. Compute. Its SYCL-specific API aids in GPU acceleration of SYCL kernels. In contrast, you may use oneDPL‘s Device Selection API to dynamically assign available computing resources to your workload in accordance with pre-established device execution rules.
For simple, automatic CUDA to SYCL code conversion for multiarchitecture programming free from vendor lock-in, the library easily interfaces with the Intel DPC++ Compatibility Tool and its open equivalent, SYCLomatic.
About the Code Sample
With just few code modifications, the pSTL offload code example demonstrates how to offload common C++ parallel algorithms to SYCL devices (CPUs and GPUs). Using the fsycl-pstl-offload option with the Intel oneAPI DPC++/C++ Compiler, it exploits an experimental oneDPL capability.
To perform data parallel computations on heterogeneous devices, the oneDPL Parallel API offers the following execution policies:
- Unseq for sequential performance
- Par stands for parallel processing.
- Combining the effects of par and unseq policies, par_unseq
The following three programs/sub-samples make up the code sample:
- FileWordCount uses C++17 parallel techniques to count the words in a file.
- WordCount determines how many words are produced using C++17 parallel methods), and
- Various STL algorithms with the aforementioned execution policies (unseq, par, and par_unseq) are implemented by ParSTLTests.
The code example shows how to use the –fsycl-pstl-offload compiler option and standard header inclusion in the existing code to automatically offload STL algorithms called by the std:execution::par_unseq policy to a selected SYCL device.
You may offload your SYCL or OpenMP code to a specialized computing resource or an accelerator (such CPU, GPU, or FPGA) by using specific device selection environment variables offered by the oneAPI programming paradigm. One such environment option is ONEAPI_DEVICE_SELECTOR, which restricts the selection of devices from among all the compute resources that may be used to run the code in applications that are based on SYCL and OpenMP. Additionally, the variable enables the selection of sub-devices as separate execution devices.
The code example demonstrates how to use the ONEAPI_DEVICE SELECTOR variable to offload the code to a selected target device. OneDPL is then used to implement the offloaded code. The code is offloaded to the SYCL device by default if the pSTL offload compiler option is not used.
The example shows how to offload STL code to an Intel Xeon CPU and an Intel Data Center GPU Max. However, offloading C++ STL code to any SYCL device may be done in the same way.
What Comes Next?
To speed up SYCL kernels on the newest CPUs, GPUs, and other accelerators, get started with oneDPL and examine oneDPL code examples right now!
For accelerated, multiarchitecture, high-performance parallel computing, it also urge you to investigate other AI and HPC technologies that are based on the unified oneAPI programming paradigm.