BFloat16 Enables Application-Transparent TensorFloat32^1 Emulation on AMD GPUs
When creating machine learning models, the vast majority of machine learning (ML) engineers make use of the single precision (FP32) datatype. TensorFloat32 (TF32), a drop-in replacement for FP32-based models, has lately gained popularity and is becoming more widely used. However, there is an urgent need to deliver further performance advantages for these models by using faster datatypes (such as BFloat16 (BF16)), without needing any code modifications.
At AMD, we have created a method that allows current TF32 applications to leverage BFloat16 matrix operations. This is accomplished by automatically converting the weights and activations in the model to BF16 and aggregating the result in FP32.
An application that already makes use of the TF32 infrastructure would be able to notice acceleration while using this strategy, and it would do so without requiring any extra code modifications. Our source code may be accessed at ROCm Software Platform/pytorch at release/1.13_tf32_medium on github.com. As a first drop, we have concentrated on Large Language Models (LLMs) with acceleration for Pytorch Linear layers; later releases will include other primitives such as convolutions.
Specifications of the Methodology for Implementation
Pytorch now supports three different levels of precision for FP32 models: the highest level, in which the model utilizes FP32-based GEMMs; the high level, in which the model uses native TF32-based GEMMs if they are available; and the medium level, in which the model uses BF16-based GEMMs, which are not yet implemented in the current version of Pytorch.
In order to put our method, which we refer to as TF32-Emulation, into practice, we make use of the infrastructure designed for medium accuracy. Pytorch’s Linear layers employ TF32-emulation in its present implementation; this was chosen because of its superior performance. At this time, AMD is mulling on whether or not to provide support for new operators.
Our source code may be accessed at ROCm Software Platform/pytorch at release/1.13_tf32_medium on github.com.To make use of this function, one must first create PyTorch by clicking on the aforementioned code link and then include the line “torch.backends.cudnn.allow_tf32 = True” in their program’s code at the very beginning of the main file, immediately behind the import declarations.
Examining the Differences in Performance
On a collection of machine learning models that are commonly used across the community, we evaluate how well the TF32-Emulation technique performs in comparison to the FP32 implementation that is used by default.
we see a speedup of improvement to 1.79 times over the default implementation on Transformer generated from MLPerf implementation; the relative speedup varies related to the time spent on GEMM operations in comparison to elementwise/reduction/norm operations.
Although we did see the predicted convergence for TF32-Emulation in comparison to FP32 based implementations for these models, we have chosen not to publish the results of the convergence test due to the desire to keep this report as concise as possible.
[…] is reported that the power consumption of NVIDIA’s future GeForce RTX 40 SUPER family of GPUs would be comparable to that of the Non-SUPER […]