Intel MPI Library on Google Cloud
Innovation is fueled by high performance computing (HPC) in many industries. To mention a few benefits, HPC uses simulation to shorten product design cycles, improve product safety, provide accurate weather forecasts in real time, train AI foundation models, and uncover scientific breakthroughs in a variety of fields. By using a large number of servers, virtual machines, or computing units in close coordination with one another and communicating via the Message Passing Interface (MPI), HPC addresses these computationally intensive challenges. We demonstrate in this blog post how Google Cloud used the Intel MPI Library to improve HPC performance on Google Cloud.
For intensive tasks, Google Cloud provides a variety of virtual machine (VM) families, such as H3 compute optimised VMs, which are perfect for HPC applications. These virtual machines (VMs) are optimised by Intel software tools to combine the most recent advancements in processing, networking, and storage onto a single platform. They include Google’s Titanium technology, which enables sophisticated network offloads and other features.
The Intel Infrastructure Processing Unit (IPU) E2000 securely enables low latency 200G Ethernet by offloading networking from the CPU onto a specialised device in third-generation virtual machines (VMs) like H3, C3, C3D, or A3. Moreover, network offload benefits are extended to HPC workloads like computational geoscience, molecular dynamics, weather forecasting, front-end and back-end Electronic Design Automation (EDA), computer aided engineering (CAE), and computational fluid dynamics (CFD) thanks to integrated support for Titanium in the Intel MPI library. The Intel MPI Library is available in the Google Cloud HPC VM Image at the most recent version.
Optimized MPI Library for Titanium and Third Generation VMs
The MPI API standard is implemented by the Intel MPI Library, a multi-fabric message-passing library. Based on the open-source MPICH project, this commercial-grade MPI implementation handles fabric-specific communication aspects via the OpenFabrics Interface (OFI, often known as libfabric). There are numerous libfabric providers to choose from, each tailored for a distinct combination of fabrics and protocols.
In particular, the PSM3 provider is enhanced in version 2021.11 of the Intel MPI Library, which also offers tunings for the PSM3 and OFI/TCP providers for the Google Cloud environment, which includes the Intel IPU E2000. The 4th Generation Intel Xeon Scalable Processors offer high core counts and sophisticated features that the Intel MPI Library 2021.11 capitalises on. It also supports updated Linux OS distributions and versions of libraries and applications. When combined, these enhancements provide third-generation virtual machines with Titanium access to new performance and application features.
Improvement of HPC application efficiency
Through parallel computing, programs such as Siemens Simcenter STAR-CCM+ software reduce the time-to-solution. For instance, if the same problem can be solved in half the time with twice as many computational resources, then parallel scaling is 100% efficient and the speedup is twice as great as when using half as many resources. In actuality, a 2x speedup per doubling could not be possible for a number of reasons, including inter-node communication overhead or insufficient parallelism exposure. Improving the communication library can immediately address the latter issue.
Intel MPI Benchmarks
Using a number of common benchmarks, Google and Intel evaluated Simcenter STAR-CCM+ on H3 instances to show off the performance gains of the new Intel MPI Library. Five common benchmarks up to 32 VMs (2,816 cores) are displayed in the figure. As you can see, all tested situations produce good speedups; the only benchmark that stops scaling beyond 16 nodes is LeMans_Poly_17M, which is the smallest benchmark due to its small issue size (which is not addressed by communication library performance). Superlinear scaling is even visible for certain VM counts in several benchmarks (LeMans_100M_Coupled and AeroSuvSteadyCoupled106M), most likely as a result of the larger accessible cache.
Google Cloud used the ratio of the runtimes of Intel MPI 2021.11 and Intel MPI 2021.7 for each run to demonstrate the improvements of the former over the latter. The faster version’s parallel runtime is divided by the faster version’s parallel runtime to get this speedup ratio, which is displayed in the table below.
The table demonstrates that the optimised Intel MPI 2021.11 version provides greater parallel scalability and absolute performance for almost all workloads and node counts. This efficiency benefit (up to 1.06x improvement) is already noticeable at just two virtual machines (VMs) and increases significantly at higher VM counts (between 2.42x and 5.27x at 32 VMs). This efficiency gain results in reduced time-to-solution and cheaper costs. The remarkable 11.53x gain for the smallest test (LeMans_Poly_17M) at 16 VMs shows that, in contrast to the previous version, the improved MPI version permits good scaling up to 16 VMs.
These findings show that Simcenter STAR-CCM+ on Google Cloud is more scalable thanks to the optimised Intel MPI Library, enabling end users to solve problems more quickly and make better use of their cloud resources.
Intel MPI 2021.7 and its TCP provider, as well as Intel MPI 2021.11 and the PSM3 libfabric provider, were used to execute the benchmarks. On CentOS Linux release 7.9.2009-powered Google Cloud H3 instances, with 88 MPI processes per node and 200 Gbps networking, Simcenter STAR-CCM+ version 2306 (18.06.006) was evaluated.