Wednesday, July 17, 2024

Intel Gaudi AI Accelerator Dominates GPT-3 in Speed

Intel Gaudi AI Accelerator

Intel submitted results for Intel Gaudi2 accelerators and 4th Gen Intel Xeon Scalable CPUs with Intel Advanced Matrix Extensions (Intel AMX), while MLCommons released the results of the industry standard MLPerf training v3.1 benchmark for training AI models. Using the FP8 data format on the v3.1 training GPT-3 test, Intel Gaudi2 showed a notable 2x speed increase. The benchmark submissions reaffirmed Intel’s objective to provide competitive AI solutions and bring AI everywhere.

The most recent MLCommons MLPerf findings expand upon Intel’s impressive AI performance from the June MLPerf training results. There are only three accelerator solutions that are used to generate MLPerf results, and only two of them are commercially available. The Intel Xeon processor is still the sole CPU that reports MLPerf results. The other one is Intel Gaudi2.

In order to meet the wide range of client AI computing requirements, Intel Gaudi2 and 4th Gen Xeon processors provide impressive AI training performance in a number of hardware combinations.

About the Results of the Intel Gaudi2: For AI computation requirements, Gaudi2 is the sole practical substitute for NVIDIA’s H100, offering a notable price-performance ratio. The rising training performance of the AI accelerator was demonstrated by the MLPerf results for Gaudi2:

When the FP8 data type was implemented on the v3.1 training GPT-3 benchmark, Gaudi2 showed a 2x increase in performance. In comparison to the June MLPerf benchmark, the training time was halved, taking 153.58 minutes on 384 Intel Gaudi2 accelerators. Both E5M2 and E4M3 formats for FP8 are supported by the Gaudi2 accelerator, with the option for delayed scaling as needed.

Using BF16, Intel Gaudi2 exhibited training on the multi-modal Stable Diffusion model with 64 accelerators in 20.2 minutes.

Although FP8 was limited to GPT-3 in this MLPerf training submission and GPT-J in the prior inference submission, Intel is now supporting more models for both training and inference with its Gaudi2 software and tools.

Benchmark times for BERT and ResNet-50 using BF16 were 13.27 and 15.92 minutes, respectively, on eight Intel Gaudi2 accelerators.

Concerning the Fourth-Gen Xeon Results: The only CPU vendor still submitting MLPerf findings is Intel. The MLPerf scores for 4th Gen Xeon revealed its great performance:

Results for RESNet50, RetinaNet, BERT, and DLRM dcnv2 were submitted by Intel. The strong out-of-box performance results submitted for the June 2023 MLPerf test were comparable to the results for ResNet50, RetinaNet, and BERT for the 4th generation Intel Xeon scalable processors.

DLRM dcnv2 is a new model from June’s submission, with the CPU displaying a time-to-train submission of 227 minutes utilizing just four nodes.

Performance of 4th generation Xeon processors shows that many enterprise organizations can train small to mid-sized deep learning models on their current enterprise IT infrastructure using general-purpose CPUs in an economical and sustainable manner, particularly for use cases where training is an intermittent workload.

What’s Next: Intel expects further advancements in AI performance to lead to the release of MLPerf benchmarks in the near future through software updates and optimizations. Customers now have additional options for AI solutions that match changing needs for usability, performance, and efficiency thanks to Intel’s AI products.

Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.


Please enter your comment!
Please enter your name here

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes