Friday, March 28, 2025

Analog In Memory Computing: The Future Of AI Efficiency

How analog in memory computing could fuel future AI models.

Nature Computational Science covers a new IBM Research paper on analog in-memory computing. The study is one of three on the future of unconventional compute paradigms in cloud and edge AI.

Analogue In-Memory Computing has simple and focused data flows
Image Credit To IBM

Analogue In-Memory Computing has simple and focused data flows (right), whereas GPUs have complicated and dispersed data flows (left).

Artificial intelligence models have become more powerful and larger in recent years. Neural network weights are increasingly large, with some having 500 billion or trillion parameters. When you shift these weights to compute on typical computer architectures, inference takes time and energy. Analog in memory computing eliminates this constraint, saving time and energy while giving great performance.

IBM Research experts provide scalable hardware with 3D analog in-memory architecture for huge models, phase-change memory for compact edge-sized models, and algorithm advancements that speed transformer attention in three new papers.

According to a new IBM Research study, analog in-memory computing chips beat GPUs in key ways for executing cutting-edge mixture of experts (MoE) models. They showed on the cover of Nature Computational Science that each expert in a layer of a MoE network may be transferred onto a physical layer of a 3D non-volatile memory in 3D-based analogue in-memory computer chips’ brain-inspired chip architecture. The team found that this mapping can run MoE models with high throughput and energy efficiency through numerical simulations and benchmarks.

It and two other IBM Research papers show that in memory computing may fuel transformer architecture-based AI models for edge and enterprise cloud applications. These recent papers suggest it’s time to take this experimental technology outside the lab.

Layers of expertise

MoE models can split neural network layers. Every smaller layer is a ‘expert,’ meaning it handles a specific subset of data. Routing layers choose which experts to deliver input to. When they simulated two conventional MoE models in their performance simulation tool, the hardware outscored modern GPUs.

In the new study, scientists mapped MoE network layers onto analogue in memory computing tiles with numerous vertically stacked tiers using simulated hardware. Tiers with model weights might be accessed independently. In the article, the layers are depicted as a high-rise office block with floors hosting experts who may be called upon.

Putting expert layers on different tiers is natural, but results matter. The 3D analogue in-memory computing architecture outperformed commercial GPUs in MoE model simulations in throughput, area efficiency, and energy efficiency. GPUs waste time and energy transporting model weights between memory and compute, a problem that doesn’t present with analogue in-memory computing architectures.

Analogue AI cores composed of phase-change memory (PCM) cell arrays
Image Credit To IBM

Transformers at the edge

The team’s second article, an accelerator architectural analysis, was given as an invited talk at the IEEE International Electron Devices Meeting in December. They showed that it is possible to use phase-change memory (PCM) devices to store model weights through the conductivity of a piece of chalcogenide glass in order to do AI inference on edge applications with ultra-low-power devices. The glass changes from a crystalline to an amorphous solid when additional voltage flows through it. This alters the value of matrix vector multiplication operations by making it less conductive.

Phase change memory (PCM) is advancing transformer models in an efficient manner for analogue in-memory computing. With anticipated energy benefits, the team’s neural processing units demonstrated a competitive throughput for transformer models.
She and her colleagues investigated a transformer model specifically designed for mobile devices, known as MobileBERT, for the purposes of this work. According to their own throughput benchmark, the team’s suggested neural processing unit outperformed a low-cost accelerator currently available on the market, while a MobileBERT inference benchmark showed that it was on par with certain high-end smartphones.

According to Sebastian, this work is a step towards the day when mass-produced, inexpensive analogue in memory computing devices that store all model weights for AI models on-chip will be possible. These gadgets might serve as the foundation for microcontrollers that support AI inference for edge applications, such as cameras and automotive sensors for self-driving cars.

Analog transformers

Finally, the researchers described the first implementation of a transformer architecture, which includes all matrix vector multiplication operations including static model weights, on an analogue in-memory computing chip. It performed within 2% accuracy on a benchmark known as the Long Range Arena, which assesses accuracy on lengthy sequences, in contrast to a scenario where all operations are performed in floating point. The findings were published in the Nature Machine Intelligence journal.

Overall, these tests demonstrated that the attention process can be accelerated using analogue in memory computing, which is a significant transformer bottleneck, according to an IBM research scientist. It is necessary to perform the attention computation in transformers, which cannot be easily accelerated in analogue. The values that must be computed in the attention process are a challenge. Since they are always changing, it would be impractical to constantly re-program the analogue devices, which would use up too much energy and endurance.

They overcame that obstacle by using their experimental analogue chip to perform nonlinear functions using a mathematical method known as kernel approximation. It was previously thought that this circuit architecture could only handle linear functions, hence this development is significant. Similar to the system replicated in the MoE studies, this chip’s brain-inspired design stores model weights in phase-change memory devices organized in crossbars.

For any AI accelerator, but especially for analogue in memory computing accelerators, attention compute is a nonlinear function and an extremely unpleasant mathematical procedure. “But this proves to can do it by using this trick, and it can also improve the overall system efficiency.”

By projecting inputs into a higher-dimensional space using randomly sampled vectors and then computing the dot product in the resulting higher-dimensional space, the kernel approximation technique circumvents the requirement for a nonlinear function. Although kernel approximation is a general technique that may be used in a variety of contexts, it is particularly effective when applied to systems that use analogue in-memory computing.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post