BFloat16 Turbocharges AMD GPUs Unleash the Future!

By agarapuramesh

November 2, 2023

1

228

BFloat16 Turbocharges AMD GPUs Unleash the Future!

- Advertisement -

Page Contents

BFloat16 Enables Application-Transparent TensorFloat32^1 Emulation on AMD GPUs

When creating machine learning models, the vast majority of machine learning (ML) engineers make use of the single precision (FP32) datatype. TensorFloat32 (TF32), a drop-in replacement for FP32-based models, has lately gained popularity and is becoming more widely used. However, there is an urgent need to deliver further performance advantages for these models by using faster datatypes (such as BFloat16 (BF16)), without needing any code modifications.

At AMD, we have created a method that allows current TF32 applications to leverage BFloat16 matrix operations. This is accomplished by automatically converting the weights and activations in the model to BF16 and aggregating the result in FP32.

- Advertisement -

An application that already makes use of the TF32 infrastructure would be able to notice acceleration while using this strategy, and it would do so without requiring any extra code modifications. Our source code may be accessed at ROCm Software Platform/pytorch at release/1.13_tf32_medium on github.com. As a first drop, we have concentrated on Large Language Models (LLMs) with acceleration for Pytorch Linear layers; later releases will include other primitives such as convolutions.

Specifications of the Methodology for Implementation

Pytorch now supports three different levels of precision for FP32 models: the highest level, in which the model utilizes FP32-based GEMMs; the high level, in which the model uses native TF32-based GEMMs if they are available; and the medium level, in which the model uses BF16-based GEMMs, which are not yet implemented in the current version of Pytorch.

In order to put our method, which we refer to as TF32-Emulation, into practice, we make use of the infrastructure designed for medium accuracy. Pytorch’s Linear layers employ TF32-emulation in its present implementation; this was chosen because of its superior performance. At this time, AMD is mulling on whether or not to provide support for new operators.

Our source code may be accessed at ROCm Software Platform/pytorch at release/1.13_tf32_medium on github.com.To make use of this function, one must first create PyTorch by clicking on the aforementioned code link and then include the line “torch.backends.cudnn.allow_tf32 = True” in their program’s code at the very beginning of the main file, immediately behind the import declarations.

- Advertisement -

Examining the Differences in Performance

On a collection of machine learning models that are commonly used across the community, we evaluate how well the TF32-Emulation technique performs in comparison to the FP32 implementation that is used by default.

we see a speedup of improvement to 1.79 times over the default implementation on Transformer generated from MLPerf implementation; the relative speedup varies related to the time spent on GEMM operations in comparison to elementwise/reduction/norm operations.

Although we did see the predicted convergence for TF32-Emulation in comparison to FP32 based implementations for these models, we have chosen not to publish the results of the convergence test due to the desire to keep this report as concise as possible.

- Advertisement -

1 COMMENT

RTX 40 Presenting: Same Power, Better Graphics November 2, 2023 At 5:37 pm
[…] is reported that the power consumption of NVIDIA’s future GeForce RTX 40 SUPER family of GPUs would be comparable to that of the Non-SUPER […]
Log in to leave a comment

BFloat16 Turbocharges AMD GPUs Unleash the Future!

BFloat16 Enables Application-Transparent TensorFloat32^1 Emulation on AMD GPUs

Specifications of the Methodology for Implementation

Examining the Differences in Performance

New Cloud Translation AI Improvements Support 189 Languages

Introducing Azure HBv5 & Azure ND GB200 V6 Virtual Machines

Dell APEX File Storage For Azure: Unlock Your AI Potential

1 COMMENT

LEAVE A REPLY Cancel reply

Recent Posts

New Cloud Translation AI Improvements Support 189 Languages

Introducing Azure HBv5 & Azure ND GB200 V6 Virtual Machines

Dell APEX File Storage For Azure: Unlock Your AI Potential

Dell APEX Protection Services: Azure’s Cloud Security Future

Unveiling Microsoft Azure Integrated HSM And Azure Boost DPU

Presenting Nokia Private Wireless With Dell NativeEdge

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

Cardea Z540 SSD Revolutionizes Storage

What is Azure Policy in Microsoft Azure

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

MSI Motherboards with Intel Application Optimization

About Us

POPULAR CATEGORY