NVIDIA Speeds up MLPerf Standards Generative AI Training
Passing the Exam: NVIDIA Speeds up MLPerf Standards Generative AI Training NVIDIA H100 Tensor Core GPUs broke previous marks in the most recent industry-standard testing thanks to their unparalleled scaling and software advancements.
The most recent MLPerf industry benchmarks demonstrate how NVIDIA’s AI technology has elevated the standard for high speed computing and AI training.
One notable record and milestone in generative AI sticks out among the numerous others: In just 3.9 minutes, the AI supercomputer NVIDIA Eos, which is driven by an incredible 10,752 NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking, finished a training benchmark using a GPT-3 model that has 175 billion parameters and one billion tokens.
Compared to 10.9 minutes, the record NVIDIA set when the test was first offered less than six months ago, that is a roughly three-fold increase.
The benchmark makes use of a subset of the entire GPT-3 data set, which powers the well-known ChatGPT service. Extrapolating from this, Eos was able to train in just eight days, which is 73 times faster than a previous state-of-the-art system that used 512 A100 GPUs.
Training time acceleration speeds up time-to-market, saves energy, and lowers costs. Large language models are made publicly available through heavy lifting, enabling any organization to use them with technologies like NVIDIA NeMo, a framework for tailoring LLMs.
1,024 NVIDIA Hopper architecture GPUs established a record in a new generative AI test this round, finishing a training benchmark based on the Stable Diffusion text-to-image model in 2.5 minutes.
Given that generative AI is the most revolutionary technology of our time, MLPerf further solidifies its position as the industry standard for evaluating AI performance by incorporating these two tests.
System Sizing Takes Off
The usage of the greatest number of accelerators ever applied to an MLPerf benchmark contributed to the most recent results. When NVIDIA used 3,584 Hopper GPUs for AI training in June, the 10,752 H100 GPUs considerably outstripped that number.
Thanks in part to software enhancements, the 3x scaling of GPU numbers resulted in a 2.8x scaling of performance, or 93% efficiency rate.
Given that LLMs are expanding by an order of magnitude annually, efficient scalability is a fundamental prerequisite for generative AI. The most recent outcomes demonstrate NVIDIA’s capacity to handle this extraordinary task for even the biggest data centers on the planet.
The accomplishment may be attributed to the utilization in the most recent round by both Eos and Microsoft Azure of a full-stack platform of breakthroughs in accelerators, systems, and software.
10,752 H100 GPUs were used by Eos and Azure in different submissions. They demonstrated the effectiveness of NVIDIA AI in data center and public cloud deployments by achieving within 2% of the same performance.
Eos is used by NVIDIA for a variety of vital tasks. It contributes to the advancement of projects like NVIDIA DLSS, which is AI-powered software for cutting-edge computer graphics, and NVIDIA Research projects like ChipNeMo, which are generative AI tools for GPU design.
Progress in All Workloads
NVIDIA made advancements in generative AI and achieved numerous new marks in this round.
H100 GPUs, for instance, were 1.6 times faster than the widely used recommender models trained on earlier rounds of data that assist users in finding what they’re looking for on the internet. On the computer vision model RetinaNet, performance increased by 1.8 times.
Both scalable hardware and software advancements were responsible for these increases.
Once more, NVIDIA was the only business to conduct every MLPerf test. In all nine benchmarks, H100 GPUs showed the highest scalability and fastest performance.
Accelerations result in reduced expenses, quicker time to market, and energy savings for customers that are training large LLMs or tailoring them with frameworks like NeMo to meet their unique business requirements.
This time, eleven system manufacturers including ASUS, Dell Technologies, Fujitsu, GIGABYTE, Lenovo, QCT, and Supermicro used the NVIDIA AI platform in their submissions.
Partners of NVIDIA take part in MLPerf because they are aware of its value as a tool for clients assessing AI systems and suppliers.
Benchmarks for HPC Expand
The performance of H100 GPUs in MLPerf HPC, a different benchmark for AI-assisted simulations on supercomputers, was up to twice that of NVIDIA A100 Tensor Core GPUs in the previous HPC round. The outcomes demonstrated improvements of up to 16 times since the 2019 MLPerf HPC round one.
A novel test for training Open Fold a model that infers a protein’s three-dimensional structure from its amino acid sequence was incorporated into the benchmark. Open Fold can complete crucial healthcare tasks in minutes that previously took researchers weeks or months.
Since most drugs act on proteins, the cellular machinery that helps control many biological processes, an understanding of a protein’s structure is essential to quickly discovering effective medications.
H100 GPUs trained OpenFold in 7.5 minutes in the MLPerf HPC test. The OpenFold test is a sample of the whole AlphaFold training procedure, which involved 128 accelerators and 11 days of work two years ago.
The OpenFold model and the training software developed by NVIDIA will soon be accessible in NVIDIA BioNeMo, a generative AI drug discovery platform.
In this round, a number of partners submitted content to the NVIDIA AI platform. These included Lawrence Berkeley National Laboratory (LAB) with support from Hewlett Packard Enterprise (HPE), Texas Advanced Computing Center, Dell Technologies, and Clemson University’s supercomputing centers.
Benchmarks With Wide Support
The industry and academia have mostly supported the MLPerf benchmarks since they were first introduced in May 2018. Amazon, Arm, Baidu, Google, Harvard, HPE, Intel, Lenovo, Meta, Microsoft, NVIDIA, Stanford University, and the University of Toronto are among the companies that support them.
Because MLPerf testing are transparent and objective, consumers may trust the findings to help them make wise purchasing decisions.
The MLPerf repository hosts all of the tools that NVIDIA used, enabling developers to get identical top-notch outcomes. NGC, NVIDIA’s software hub for GPU applications, hosts containers that are regularly updated with these software enhancements.
[…] NVIDIA CMP 50HX GPU was first built for crypto mining, however it has now been repurposed as a gaming card due to its superior performance. However, it does not especially impress in terms of its performance. NVIDIA CMP “crypto” GPUs are utterly inappropriate for gaming due to the fact that they have a restricted number of PCIe lanes and no driver support. […]
[…] Nvidia’s H200 GPU will power next-generation AI exascale supercomputers with 4.1GB of HBM3e and 4.8 TB/s bandwidth […]
[…] for MLPerf Storage v0.5 on the Micron 9400 NVMe SSD were just released by Micron. These outcomes demonstrate how […]