Intel FPGAs speed up databases with oneAPI and SIMD orders

December 14, 2023

861

A cutting-edge strategy for improving single-threaded CPU speed is Single Instruction Multiple Data (SIMD).

FPGAs are known for high-performance computing via customizing circuits for algorithms. Their tailored and optimized hardware accelerates difficult computations.

SIMD and FPGAs seem unrelated, yet this blog article will demonstrate their compatibility. By enabling data parallel processing, FPGAs can boost processing performance with SIMD. For many computationally intensive activities, FPGA adaptability and SIMD efficiency are appealing.

High-performance SIMDified programming

SIMD parallel processing applies a single instruction to numerous data objects. Special hardware extensions can execute the same instruction on several data objects simultaneously.

SIMDified processing uses data independence to boost software application performance by rewriting application code to use SIMD instructions extensively.

Key advantages of SIMDified processing include:

Increased performance: SIMDified processing boosts computationally intensive software applications.

Integrability: Intrinsics and dedicated data types make SIMDified processing desirable.

SIMDified processing is available on many current processors, giving it a viable option for computational speed improvement.

Despite its benefits, SIMDified processing is not ideal for many applications. Applications with minimal data parallelism will not benefit from SIMDified processing. It is a convincing method for improving data-intensive software applications.

SIMD Portability Supports Heterogeneity

SIMD registers and instructions make up SIMD instruction sets. SIMD intrinsics in C/C++ are the best low-level programming method for performance.

Low-level programming in heterogeneous settings with different hardware platforms, operating systems, architectures, and technologies is difficult due to hardware capabilities, data parallelism, and naming standards.

Specialized implementations limit portability between platforms, hence SIMD abstraction libraries provide a common SIMD interface and abstract SIMD functions. These libraries use C++ template metaprogramming and function template specializations to translate to SIMD intrinsics and potential compensations for missing functions, which must be implemented.

C/C++ libraries let developers construct SIMD-hardware-oblivious application code and SIMD extension code with minimum overhead. Separating SIMD-hardware-oblivious code with a SIMD abstraction library simplifies both sides.

This method has promoted many SIMD libraries and abstraction layers to solve problems:

Examples of SIMD libraries
Google Highway (open-source)
Xsimd (C++ wrapper for SIMD instances)

Such libraries allow SIMDified code to be designed once and specialized for the target SIMD platform by the SIMD abstraction library. Libraries and varied design environments suit SIMD instructions and abstraction.

Accelerating with FPGAs

FPGAs speed software at low cost and power. Traditional FPGAs required a strong understanding of digital design concepts and specific languages like VHDL or Verilog. FPGA-based solutions are harder to access and more specialized than CPU or GPU-based computing platforms due to programming complexity and code portability. Intel oneAPI changes this.

Intel oneAPI is a software development kit that unifies CPU, GPU, and FPGA programming. It supports C++, Fortran, Python, and Data Parallel C++ (DPC++) for heterogeneous computing to improve performance, productivity, and development time.

Intel FPGAs speed up databases with oneAPI and SIMD orders — Image Credit to intel

Since Intel oneAPI can target FPGAs from SYCL/C++, software developers are increasingly interested in using them for data processing. FPGAs can be used with SIMDified applications by adding them as a backend to the SIMD abstraction library. This allows SIMD applications with FPGAs.

SIMD and FPGAs go together

Annotations let the Intel DPC++ compiler synthesis C++ code into circuits and auto-vectorize data-parallel processing. Annotating and implementing code arrays as registers on an FPGA removes data access constraints and allows parallel processing from sink to source. This enables SIMD performance acceleration using FPGAs straightforward and configurable.

SIMD abstraction libraries are a logical choice for FPGA SIMD processing. As noted, the libraries support Intel and ARM SIMD instruction set extensions. TSL abstraction library simplifies FPGA SIMD instruction implementation in the following example. The scalar code specifies loading registers, and the pragma unroll attribute tells the DPC++ Compiler to implement all pathways in parallel in the generic element-wise addition example below.

This simple element-wise example has no dependencies, and comparable implementations will work for SIMD instructions like scatter, gather, and store. Optimization can also accelerate complex instructions.

A horizontal reduction requires a compile-time adder tree of depth ld(N), where N is the number of entries. Unroll pragmas with compile-time constants can implement adder trees in a scalable manner, as shown in the following code example.

Software that calls a library of comparable SIMD components can expedite SIMD instructions on Intel FPGAs by adding the examples above.

Intel FPGA Board Support Package adds system benefits. Intel FPGAs use a BSP to describe hardware interfaces and offer a kernel shell.

The BSP enables SYCL Universal Shared Memory (USM), which frees the CPU from data transfer management by exchanging data directly with the accelerator. FPGAs can be coprocessors.

The pre-compiled BSP generates only kernel logic live, reducing runtime.

Intel FPGAs are ideal for SIMD and streaming applications like current composable databases because to their C++/SYCL compatibility, CPU data transfer offloading, and pre-compiled BSPs.

SIMD/FPGA simplicity

At SiMoDSIGMOD 2023 in Seattle, USA, Dirk Habich, Alexander Krause, Johannes Pietrzyk, and Wolfgang Lehner of TU Dresden presented their paper “Simplicity done right for SIMDified query processing on CPU and FPGA” on using FPGAs to accelerate SIMD instructions. The work, supported by Intel’s Christian Färber, illustrates how practical and efficient developing a SIMDified kernel in an FPGA is while achieving top performance.

The paper evaluated FPGA acceleration of SIMD instructions using a dual-socket 3rd-generation Intel Xeon Scalable processor (code-named “Ice Lake”) with 36 cores and a base frequency of 2.2 GHz and a BitWare IA-840f acceleration card with an Intel Agilex 7 AGF027 FPGA and 4x 16 GB DDR4 memories.

First, they gradually increased the SIMD instance register width to see how it affected maximum acceleration bandwidth. The first instance, a simple aggregation, revealed that the FPGA accelerator’s bandwidth improves with data width doubling until the global bandwidth saturates an ideal acceleration case.

The second scenario, a filter-count kernel with a data dependency in the last stage of the adder tree, demonstrated similar behavior but saturates earlier at the PCIe link width. Both scenarios demonstrate the considerable speeding gains of natively parallel instructions on a highly parallel architecture and suggest that wide memory accesses could sustain the benefits.

Final performance comparisons compared the FPGA and CPU. CPU and FPGA received the same multi-threaded AVX512-based filter-count kernel. As expected, per-core CPU bandwidth decreased as thread count and CPU core count grew. FPGA performance was peak across all workloads.

Based on this work, the TU Dresden and Intel team researched how to use TSL to use an FPGA as a bespoke SIMD processor.

7 COMMENTS

CXL Adoption Gains Steam With Intel Leading December 15, 2023 At 12:22 pm
[…] Intel FPGAs helped to propel CXL forward by announcing two firsts for the […]
Log in to leave a comment
Computer Vision's Future From Intel Tech December 21, 2023 At 10:40 am
[…] measurement runs on the Intel Developer Cloud. Using Intel Extensions for PyTorch (IPEX) powered by oneAPI and Intel engineers, multiple optimizations were tested for object detection (YOLOv8). For […]
Log in to leave a comment
Intel Labs AI Reference Kits For Next-Gen Health Tech December 27, 2023 At 9:40 am
[…] a concentration on AI/ML research, Dev Aryan Khanna is an Intel Student Ambassador for oneAPI. His most recent project was the Healthcare AI Reference Kits Companion, an intelligent healthcare […]
Log in to leave a comment
Dive Into A OneAPI Spotlight With Leading Innovators January 18, 2024 At 1:15 pm
[…] with oneAPI: A oneAPI Spotlight Discussion with AI Developers is the title of the podcast episode that is part of Intel’s […]
Log in to leave a comment
Intel Liftoff Strategies Boost AI Startups January 25, 2024 At 2:57 pm
[…] inventors’ feature stories demonstrate how Intel’s assistance has accelerated their progress, which in turn has accelerated the advancement of […]
Log in to leave a comment
Ultimate Guide Cloud Spanner Emulator Testing Integrated! January 27, 2024 At 2:18 pm
[…] testing for Spanner. For the purpose of quickly developing applications supported by a Spanner database, the emulator imitates the behavior of Spanner outside of Google […]
Log in to leave a comment
Performance Testing Strategy Document January 30, 2024 At 3:15 pm
[…] the product catalog. Both apps access the product inventory database. In such cases, individual database testing is necessary due to its interaction with multiple […]
Log in to leave a comment

Intel FPGAs speed up databases with oneAPI and SIMD orders

High-performance SIMDified programming

Key advantages of SIMDified processing include:

SIMD Portability Supports Heterogeneity

Accelerating with FPGAs

SIMD and FPGAs go together

SIMD/FPGA simplicity

Google NewFront: Display & Video 360 Pricing For Rethink CTV

Dell Nutanix And PowerFlex Enable Scalability, Performance

iOS 18.4.1 Update Addresses Active Security Attacks

7 COMMENTS

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

AMD Radeon Pro W6600 Benchmark in CAD, Video Editing

Intel Core Ultra 5 225H Performance for Everyday Tasks

Intel Core i9 13900K Price, Benchmark, and Specifications

NVIDIA Tesla V100 Price, Features And Specifications

Google Magic Mirror Experience Driven by Gemini Models

Pluto AI: A New Internal AI Platform For Enterprise Growth

About Us