Even when utilizing a high-level development methodology, the HPC workload was able to obtain the greatest performance on an Intel Agilex 7 FPGA.
CPUs or GPUs are usually the primary targets of workloads in high performance computing (HPC). But this is beginning to change with the availability of strong Intel FPGA accelerators. Even with a high-level development workflow, an HPC workload can attain optimal performance on an FPGA, as this case study illustrates. This case study describes the specifics of the application, the tool flow, and the transition to the most recent Intel Agilex 7 FPGA technology. This knowledge can be useful to other researchers for their own HPC applications.
Introductory
In contemporary scientific study, molecular modeling has grown in importance as a technique. Scientists can build intricate models of molecules and investigate their characteristics and behaviors with the aid of computer techniques. To achieve their objectives, researchers must overcome computational obstacles, as building these models requires billions of calculations.
Molecular Modeling Difficulties
Creating novel methods to advance the study of molecular structures is the main goal of the Institute for Advanced Chemistry of Catalonia (CSIC). Molecular structural knowledge can be applied to biomedicine to develop novel drugs and therapies. It is well recognized and understood how to create computational molecular models and confirm their structure.
The following procedures are part of the iterative process for confirming a molecule’s structure:
- Make a molecular model computationally.
- Produce its spectrum of computations.
- Examine the differences between the computed and known analytic spectra of the chemical.
- If the spectra coincide, the molecule’s structure is the same.
- If they don’t match, create a new spectrum by improving the computation model.
This methodology’s computing burden during the verification stage is its main drawback.
It takes 9644.544 seconds, or two hours and forty-one minutes, to analyze a simple molecule with two million data points on an Intel Xeon Gold processor (3.2 GHz, one core, one thread). This drastically shortens the development time and only permits three or four iterations every working day.
Given that this was an obvious high performance computing problem, the CSIC scientists sought assistance from the Barcelona Supercomputing Center (BSC).
HPC: Optimal for FPGAs
The world-class computing resources of BSC are well known, especially its heterogeneous data center design with CPUs, GPUs, and FPGAs installed. They looked into whether an FPGA or GPU would be more appropriate for their purposes because they knew a CPU would be too slow.
The algorithm, built as an OpenCL kernel, was analyzed by the BSC team.
Crucial Algorithm Properties:
GPUs and FPGAs can effectively support the algorithm’s use of single-precision and double-precision floats.
For-loops that are nested and have an upper limit or possible upper limit of N—two million are present.
It would take a lot of hardware, a large GPU, and tremendous power consumption to use a GPU if all of the for-loops were entirely unrolled.
Is it possible to implement something more effectively by using an FPGA? Would the workload development be difficult if FPGAs were used?
HPC Benefits of FPGA
Workloads for which GPUs are not optimized can be handled by FPGAs. Even though FPGAs implement algorithms on proprietary hardware, software developers can now access FPGA performance through high-level tool flows like oneAPI.
FPGAs have the following special abilities:
- Performance: The highly flexible architecture of FPGAs allows them to best fit the algorithm. This indicates that the approach can be used without having to be modified for the fixed architecture seen in CPUs and GPUs. Direct processing of incoming data from memory eliminates the need for the CPU.
- Programmability: Workloads on FPGAs can be modified dynamically to reflect the most recent advancements in algorithms. With its abundant I/O resources, they can distribute heavy workloads across several FPGAs and run applications in parallel.
- Productivity: By making it simple to install and configure OFS cards in servers that already exist, Open FPGA Stack (OFS) allows FPGAs to boost productivity.
- Power: The use of power is very important. FPGAs are less expensive and require less power because they can operate at lower clock rates and in fewer clock cycles than CPUs and GPUs.
- Cost: FPGA acceleration cards come in a variety of options. Upgrading current solutions may not always be more affordable than installing FPGA cards to increase performance.
An FPGA would be the most suitable device for this technique, according to the BSC HPC team.
Earlier Accelerators
Two Intel Programmable Acceleration Card (Intel PAC) solutions were already available to the BSC HPC acceleration team:
- Programmable Accelerator Cards: Intel Arria 10 GX
- PAC D5005 Intel FPGA
On the Intel PACs, they computed the computational spectra using the OpenCL code that was already in place.
The outcomes are shown in the table below, which provides the following summary:
- The kernel execution time was lowered from roughly 10K seconds to 540.457 seconds in the first results produced on the Intel PAC using the Intel Arria 10 GX FPGA. This acceleration was 17.8X greater than the results obtained with the CPU.
- The 64-bit double-precision floating point accumulator was then converted, with an acceptable accuracy loss, into a 40-bit integer data type by BSC by utilizing the flexible architecture of the FPGA. The processing time was further lowered to 274.02 seconds by substituting arbitrary precision data types for floating-point operations, which are exclusive to an FPGA.
- The processing time was further lowered to 81 seconds by repeating this procedure with the Intel FPGA PAC D5005 based on Intel Stratix devices.
Platform | Tile Size | Unroll Factor | Target Fmax (MHz) | Actual Fmax (MHz) | Inner Loop Latency | Kernal Execution Time (N=2,000,000) | Gain |
---|---|---|---|---|---|---|---|
CPU1 | 3200 | 9644.544 | 1.0x | ||||
Aria 10 PAC | 256 | 16 | 200 | 232 | 14 | 540.457 | 17.8x |
Aria 10 PAC2 | 4096 | 32 | 200 | 244 | 5 | 270.020 | 35.2x |
D5005 | 1024 | 64 | 180 | 241 | 16 | 130.012 | 74.2x |
D50052 | 8192 | 64128 | 270 | 255 | 168 | 130.01281.573 | 118.2x |
The FPGA results are outstanding, however they were produced with OpenCL, which was originally released sixteen years ago, and two older generation FPGA technologies. The BSC team questioned if employing the newest hardware and software may help them perform even better.
An Adaptable Composable Approach for the Contemporary Data Center
Integrated Agilex FPGAs in Modern Silicon
Intel Agilex 7 F-Series
The latest Intel FPGA is the Agilex 7. High-performance computing in various applications is the Intel Agilex 7 F-Series’ objective. It can provide high throughput and low latency performance because of its sophisticated architecture, which combines the advantages of FPGA and CPU.
The Intel Agilex 7 F-Series’s power efficiency is an additional advantage. Advanced power management measures built into the architecture minimize power usage without sacrificing performance. High-performance computing at low power is made possible by the 10 nm manufacturing technology used in the construction of the FPGA fabric in the Intel Agilex 7 F-Series. This power efficiency is essential for applications like data centers, edge computing, and autonomous cars that need high-performance computation at the lowest possible power consumption.
In any contemporary data center, security is a major concern as well. The Intel Agilex 7 F-Series offers enhanced security features to help with this. Runtime security and secure boot are provided via an embedded security subsystem included into the design. Additionally, the system’s legitimacy is guaranteed by an integrated hardware root of trust. For apps like financial organizations, government agencies, and healthcare providers that keep sensitive data, these security features are crucial.
Current Toolchain: Intel oneAPI Base Toolkit
The Intel oneAPI Base Toolkit (Base Kit) simplifies cross-architecture, high-performance application development. CPUs, GPUs, FPGAs, and AI accelerators can run it. It is based on SYCL.
Pronounced “sickle,” SYCL is an open standard that is looked after by The Khronos Group. It is a cross-platform abstraction layer that is royalty-free and enables developers to use ISO C++ to build programs for heterogeneous processors. In a single source file, host and kernel code may coexist.
Simplifying the development process is one of oneAPI’s primary advantages. Without having to learn a new programming language, developers can construct applications that run on several architectures using the Intel oneAPI Base Toolkit (Base Kit). Writing code once and running it on numerous processors saves programmers a lot of time.
OneAPI’s cross-architecture compatibility lets developers future-proof their apps by supporting many processors. This can be especially crucial for developers who wish to make apps that work with various hardware configurations. It’s perfect for a facility like BSC because of this.