Saturday, April 12, 2025

Low Precision Computing: Faster AI Without Accuracy Loss

How efficiency is increased via low precision computing without sacrificing accuracy. IBM research experts are reducing the size of numbers and co-designing technology to make AI computing more efficient.

Low Precision Computing

It’s obvious that need to come up with innovative ways to help artificial intelligence accomplish more with less computer resources since it has spread from the lab to every aspect of daily life. Large language models cannot run on hundreds of GPUs in order to be commercially viable. Low-precision computing is one choice.

Low-precision arithmetic, often known as approximate computing, is ideal for AI applications where calculations don’t need to be extremely precise to yield outputs that are sufficiently accurate. Its benefits include lower latency and lower energy and computing expenses.

Low precision computing, which has been developed during the last ten or so years and has matured recently, is rapidly becoming the norm in the industry. Hardware must be co-designed to support this strategy as its importance and usefulness increase. Therefore, it is no accident that IBM’s family of AIUs, which are devices built from the ground up to train and deploy AI models as efficiently as possible, have low-precision computing as a key component of their architecture.

FP32 or FP64 floating point arithmetic is typically used by typical CPUs, which operate with extreme accuracy. FP32 or FP64 denotes that 32 or 64 bits are used to represent the integers. For computations requiring precise tolerances, such as those in engineering, medicine, or mathematics, this high degree of precision is ideal.

However, this degree of accuracy is frequently excessive for AI. Furthermore, the computational requirements for running Large language models at FP32 or FP64 precision are enormous due to the fact that AI includes enormous quantities of computations with a high degree of redundancy. Without compromising precision, these numbers can be quantized, or reduced to a small bit width.

Similar to how a brief glance is all that is needed to distinguish a rose from a daisy or how a back-of-the-envelope calculation suffices to determine the tip at a restaurant, 16-, 8-, or even 4-bit precision can offer the right amount of computational accuracy for today’s AI models. Actually, 16- and 8-bit computing are already widely used in production environments and are recognized as industry standards. In order to prevent errors from compounding and compromising model accuracy, researchers must simultaneously create algorithms to manage the impact of decreased precision.

IBM Research scientists have been developing low precision training methods as the industry shifts towards low-precision computing. They are now attempting to integrate these methods into the hardware design process for LLMs as well as the training and inference processes.

What is Full-precision computing?

A notable engineer at IBM Research and an authority on AI optimization, full-precision (or double-precision) computing often refers to operations carried out with 64-bit representations. Both precision (reducing the difference between two representable numbers) and range (raising the minimum and highest value of representable numbers) grow with the number of bits. Large, highly precise figures can be handled by computers. The outcome of a single piece of arithmetic performed by a modern computer at 64 bits of accuracy will be almost immediate.

However, the computational demands and latency of full-precision computing soon mount up when it comes to AI training and inference, which entail trillions of operations. For instance, an LLaMa3 8b model would need more than 20 trillion operations to generate 1,000 output tokens given an input of length 1,000. Context length (the size of inputs and outputs) and the number of concurrent requests both increase this figure; many new workloads, including agentic systems, necessitate context lengths of 128,000 tokens or even 1 million tokens.

Additionally, full-precision computing model weights are larger than low precision weights, requiring more memory to store each one. Additionally, for each token creation, these weights which can range from tens to hundreds of gigabytes must be transferred from memory to the computing cores. Moving from 32 to 64 bits requires four times larger computational building blocks, but every step-down reduces energy and silicon requirements by four times. More precision bits demand quadratically more energy and silicon surface area. Low-precision computing is being explored by AI researchers as a way to reduce these processes.

8-, 4-, and 2-bit levels are generally considered low-precision. IBM Research, identifies two primary approaches that low precision quantization has adopted: In the first, AI model weights are reduced to 4 bits while activation functions remain at 16 bits. In the second, known as FP8, both model weights and activation functions are reduced to 8 bits.

High accuracy low precision

Low precision is appropriate for LLMs because of their high noise tolerance due to the redundancy of their computations. “Whether you compute in full or reduced precision, the final output of any given operation and the actions taken based on that output are often the same,” says an IBM Research scientist who specializes in the technology used in low precision computing. “If my calculations use that precision, that bit width for representing a number, you can go from FP64 or FP32 down to FP16.”

To put it more precisely, the values may differ somewhat and may be regarded as errors if the operation a*b is calculated in 64, 32, or 16 bits. For whatever values a and b are, the higher bit representations will be more accurate and have a more representable range. Low-precision computation does not produce the same outcome in that regard.

To gives the following illustration of how altering precision might alter your outcomes:

  • In 16-bit precision, 1.0+(2 − 16)–1.0=0, while in 32- or 64-bit precision, =2−16=2 −16
  • As opposed to this, 1.0–1.0+(2−16)=2−16 in 16-, 32-, and 64-bit precision
  • It should operate as follows: (a+b)+c=a+(b+c), an associative characteristic in mathematics. This quality is not guaranteed at any finite-precision representation of floating-point numbers. Simply put, it is easier to notice and perceive at lesser precisions.

Low precision training

And the outcomes are self-evident. Up to 50% faster training in FP8 can be achieved without compromising quality. “And by the end of the first half of 2025, full-blown adoption is most likely to occur.”

Reducing precision to 16 or 8 bits produces mistakes for floating-point numbers. In any case, representing more range and better precision is impossible when bitwidth is reduced. However, there are methods in AI applications to make up for that quantisation loss during both model training and inference.

One of these choices is mixed-precision operations, in which accumulation (needed for matrix multiplication) is carried out at a higher precision while floating-point multiplication (which makes up more than 90% of operations during training or inferencing) is carried out at a lower precision. To improve output accuracy, this is typically used in conjunction with pre- or post-scaling of the operands before matrix multiplication.

Quantization-aware training, which makes up for the precision lost during quantization, provides an additional choice. By using both the model weights and the non-linear functions called activation functions, this technique, which is used during pre-training or fine-tuning, teaches a model the compensation factor it will need to apply when scaling for low precision. This prepares the model to correct the outputs generated during inference because it is carried out during training.

Low-precision inference

Therefore, it would be a major victory if one could reduce the silicon footprint of a particular operation from 4 mm square to 1 mm square and make that device use only 25% of the power. Because low precision computing requires fewer processing resources, developers are also able to deploy some LLMs directly on laptops and consumer-grade GPUs.

Additionally, a model’s key-value or KV cache can also benefit from less precision, even though the majority of discussions about low-precision AI computing focus on model weights and activation values. In many production-grade inference servers, the KV cache, which holds the tokens that an LLM has already created, can quickly increase in size to demand even more RAM than the model itself. The KV-cache can even be eight times as large as the model. Additionally, storing KV caches with low precision has the potential to significantly reduce the memory footprint of LLMs because sequence lengths are always increasing, particularly with AI agents that generate and evaluate their own outputs.

Although players have primarily discussed floating point numbers thus far, fixed-point representations, or integers, are also used in low precision AI computing. Additionally, in certain situations, low-precision computation may also entail changing floating-point numbers such as FP16 or FP32 to INT8 to fixed-point values. AI models trained with complete accuracy may nevertheless do low precision inferencing using either fixed-point or floating-point representations.

Low precision inference may have a drawback when you’re running a model with calculations that are so sensitive to accuracy that cutting bitwidth can change the useful outcome. “However, there are remedies for it.” You can modify model weights to take this into consideration by going back to training or fine-tuning.

Building for low precision

The development of processors as well as the manner in which utilize them are being altered by low precision computing. With AI in mind, IBM Research is developing low-precision hardware from the bottom up. Although it can function on current processors, low precision computing offers less benefits.

“You want to build hardware that can do lower precision arithmetic if you want to get all the benefits of area reduction and energy reduction.” The advantages of cramming more computation into the same amount of space or using less energy will not be available if the hardware is designed for greater precision because operating it at low precision still requires high-precision multipliers. Even if the calculations did not need for greater precision, the area and electricity costs would still be incurred.

NorthPole, Spyre, and the analogue chip prototypes are among the hardware in the AIU family that Srinivasan and other IBM Research scientists have been developing that natively supports low precision computing. For instance, the Spyre accelerator accomplishes better compute densities than other hardware commonly used for model training more horsepower with the same footprint by supporting multiplication and accumulation operations in FP16, FP8, INT8, and INT4.

Beyond low precision

Srinivasan notes that although low precision is a significant component of IBM’s AI computing strategy, it is merely one aspect of the whole. Innovations such as on-chip memory, which removes the traditional von Neumann bottleneck between memory and processing that tends to impede AI training and inference, are also advantageous to the AIU family of hardware.

Another important feature of these devices is their flexibility, which allows them to execute models at various precision levels based on the training parameters. Additionally, full-precision and low precision capabilities can be integrated onto chips, allowing models that aren’t robust to approximation to still function on the same hardware. Different precision blends that will make sense for diverse future applications are being developed by researchers.

Even less precision may be a part of low precision computing in the future, according to Srivatsa. Only -1, 0 and 1 are included in the 2-bit precision that he and others are working on. Although LLMs include multiplication, this method of bit width reduction reduces the operations to addition and subtraction, which are computationally inexpensive. Future research is being conducted in this area to determine the extent to which low precision can be used, but it is still in its infancy.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index