Friday, November 8, 2024

MLPerf Inference v4.1 For AMD Instinct MI300X Accelerators

- Advertisement -

Engineering Insights: Introducing AMD Instinct MI300X Accelerators’ MLPerf Results. The full-stack AMD inference platform demonstrated its prowess with the remarkable results AMD Instinct MI300X GPUs, powered by one of the most recent iterations of open-source ROCm, obtained in the MLPerf Inference v4.1 round.

LLaMA2-70B

The first submission concentrated on the well-known LLaMA2-70B type, which is renowned for its excellent performance and adaptability. By outperforming the NVIDIA H100 in Gen AI inference, it established a high standard for what AMD Instinct MI300X accelerators are capable of.

- Advertisement -

MLPerf Inference

Comprehending MLPerf and Its Relevance to the Industry

Efficient and economical performance is becoming more and more important for inference and training as large language models (LLMs) continue to grow in size and complexity. Robust parallel processing and an optimal software stack are necessary to achieve high-performance LLMs.

This is where the best benchmarking package in the business, MLPerf, comes into play. The open-source AI benchmarks known as MLPerf Inference, which were created by the cross-industry cooperation MLCommons, of which AMD is a founding member, include Gen AI, LLMs, and other models that give exacting, peer-reviewed criteria. Businesses are able to assess the efficacy of AI technology and software by using these benchmarks.

A major accomplishment for AMD, excelling in MLPerf Inference v4.1 demonstrates their dedication to openness and providing standardized data that enables businesses to make wise choices.

An Extensive Analysis of the LLaMA2-70B Benchmark

The AMD LLaMA2-70B model was utilized in their first MLPerf Inference. A major development in LLMs, the LLaMA2-70B model is essential for practical uses such as large-scale inference and natural language processing. A Q&A scenario using 24,576 samples from the OpenORCA dataset, each with up to 1,024 input and output tokens, was included in the MLPerf benchmarking test. Two situations were analyzed by the benchmark to assess inference performance:

- Advertisement -
  • In an offline scenario, queries are processed in batches to increase throughput in tokens per second.
  • Server Scenario: This model tests the hardware’s capacity to provide quick, responsive performance for low-latency workloads by simulating real-time queries with stringent latency limitations (TTFT* < 2s, TPOT* ≤ 200ms).

Performance of AMD Instinct MI300X in MLPerf

With four important entries for the LLaMA2-70B model, the AMD Instinct MI300X demonstrated remarkable performance in its first MLPerf Inference utilizing the Supermicro AS-8125GS-TNMR2 machine. These findings are especially noteworthy since they provide an apples-to-apples comparison with rival AI accelerators, are repeatable, vetted by peer review, and grounded in use cases that are relevant to the industry.

Combination Performance of CPU and GPU

Submission ID 4.1-0002: Two AMD EPYC 9374F (Genoa) CPUs paired with eight AMD Instinct MI300X accelerators in the Available category.

This setup demonstrated the potent synergy between 4th Gen EPYC CPUs (previously codenamed “Genoa”) and AMD Instinct MI300X GPU accelerators for AI workloads, providing performance within 2-3% of NVIDIA DGX H100 with 4th Gen Intel Xeon CPUs in both server and offline environments at FP8 precision.

Previewing Next-Generation CPU Performance

Submission ID 4.1-0070: Two AMD EPYC “Turin” CPUs and eight AMD Instinct MI300X CPUs in the Preview category.

It showcased the performance increases from the next AMD EPYC “Turin” 5th generation CPU when paired with AMD Instinct MI300X GPU accelerators. In the server scenario, it outperformed the NVIDIA DGX H100 with Intel Xeon by a small margin, and it maintained a similar level of performance even offline at FP8 precision.

LLaMA2-70B GPU

Efficiency of a Single GPU

Submission ID 4.1-0001: In the Available category, AMD Instinct MI300X accelerator with AMD EPYC 9374F 4th Gen CPUs (Genoa).

This submission emphasized the AMD Instinct MI300X’s enormous 192 GB memory, which allowed a single GPU to effectively execute the whole LLaMA2-70B model without requiring the network cost that comes with dividing the model over many GPUs at FP8 precision.

The AMD Instinct MI300X has 192 GB of HBM3 memory and a peak memory bandwidth of 5.3 TB/s thanks to its AMD CDNA 3 architecture. The AMD Instinct MI300X can execute and host a whole 70 billion parameter model, such as LLaMA2-70B, on a single GPU with ease because to its large capacity.

The scaling efficiency with the ROCm software stack is almost linear from 1x AMD Instinct MI300X (TP1) to 8x AMD Instinct MI300X (8x TP1), indicating that AMD Instinct MI300X can handle the biggest MLPerf inference model to date.

Outstanding Dell Server Architecture Outcomes Using AMD Instinct MI300X Processors

Submission ID 4.1-0022: Two Intel Xeon Platinum 8460Y+ processors and eight AMD Instinct MI300X accelerators in the Available category.

Along with AMD submissions, Dell used their PowerEdge XE9680 server and LLaMA2-70B to submit their findings, validating the platform-level performance of AMD Instinct accelerators on an 8x AMD Instinct MI300X arrangement. This proposal demonstrates their collaboration and emphasizes how strong it ecosystem is, making them a great option for deployments including both data centers and edge inference. Further information on such outcomes is available here.

Performance Of Engineering Insights

The AMD Instinct MI300X accelerators exhibit great competitive performance due to their high computational power, huge memory capacity with rapid bandwidth, and optimized ROCm software stack. The latter enables effective processing of large AI models such as LLaMA2-70B. A few important elements were pivotal:

Big GPU Memory Capacity

The AMD Instinct MI300X has the most GPU memory that is currently on the market, which enables the whole LLaMA2-70B model to fit into memory while still supporting KV cache. By avoiding model splitting among GPUs, this maximizes inference speed while avoiding network cost.

Batch Sizes: They set the max_num_seqs parameter to 2048 in the offline scenario to optimize throughput, and to 768 in the server scenario to achieve latency requirements. These values are much greater than the 256 default value used in vLLM.

Effective KV cache management is made possible by the vLLM’s paged attention support, which helps prevent memory fragmentation brought on by huge memory AMD Instinct MI300X accelerators.

FP8 Precision

AMD expanded support for the FP8 numerical format throughout the whole inference software stack, using the AMD Instinct MI300X accelerator hardware. They quantized the LLaMA2-70B model weights to FP8 using Quark while maintaining the 99.9% accuracy needed by MLPerf. To further improve speed, it improved the hipBLASLt library, introduced FP8 support to vLLM, and implemented FP8 KV caching.

Software Enhancements

Kernel Optimization: AMD Composable Kernels (CK) based prefill attention, FP8 decode paged attention, and fused kernels such residual add RMS Norm, SwiGLU with FP8 output scaling were among the many profiles and optimizations to carried out.

vLLM Enhancements: The scheduler was improved to optimize both offline and server use cases, allowing for quicker decoding scheduling and better prefill batching.

CPU Enhancement

While GPUs handle the majority of the AI task processing, CPU speed is still quite important. CPUs with fewer cores and higher peak frequencies such as the 32-core EPYC 9374F offer the best performance, particularly in server applications. Performance improvements over the 4th generation EPYC CPUs which were submitted as a preview were seen during testing with the forthcoming “Turin” generation of EPYC CPUs.

LLaMa 3.1 405B

Establishing a Standard for the Biggest Model

The AMD Instinct MI300X GPU accelerators have shown their performance in MLPerf Inference with LLaMA2-70B, and the positive outcomes set a solid precedent for their future efficacy with even bigger models, such as Llama 3.1. They are pleased to provide Day 0 support for AMD Instinct MI300X accelerators with Meta’s new LLaMa 3.1 405B parameter model.

Only a server driven by eight AMD Instinct MI300X GPU accelerators can fit the whole LLaMa 3.1 model, with 405 billion parameters, on a single server utilizing FP16 datatype MI300-7A, owing to the industry-leading memory capacities of the AMD Instinct MI300X platform MI300-25. This lowers expenses and lowers server use. The most ideal way to power the biggest open models on the market right now is with AMD Instinct MI300X accelerators.

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes