Friday, March 28, 2025

Bottleneck Von Neumann Architecture Solutions & Challenges

Why the potential of AI computing is being hindered by a design decision that was made decades ago.

Bottleneck of von Neumann Architecture

The bottleneck von neumann architecture, which divides computation and memory, is the foundation of the majority of computers. For traditional computing, this configuration has been ideal, however in AI computing, it causes a data traffic jam.

AI computing is known for using enormous amounts of energy. The sheer amount of data being handled is partially to blame for this. To build a model with billions of parameters, training frequently calls for billions or trillions of bits of data. However, it also has to do with the construction of the majority of computer chips.

The discrete calculations that modern computer processors are typically required to undertake are completed with remarkable efficiency. They are made to swiftly switch over to work on some unrelated activity, but their efficiency suffers when they have to wait for data to transfer between memory and compute. However, nearly every operation in AI computing is interconnected, meaning that when the processor is stalled waiting, there is frequently little more work that can be done, according to IBM Research scientist Geoffrey Burr.

The von Neumann bottleneck, which occurs when data moves more slowly than computation, is what processors encounter in that situation. It is the outcome of von Neumann architecture, which has been present in practically all processors for the past 60 years. It has separate memory and processing units that are coupled by a bus. Flexibility, adaptability to changing workloads, and ease of scaling systems and upgrading components are some benefits of this configuration. Because of this, this design is excellent for traditional computing and isn’t going away anytime soon.

However, a traditional processor ends up operating below its maximum capacity while it waits for model weights to be transferred back and forth from memory in AI computing, whose actions are straightforward, numerous, and extremely predictable. The IBM Research scientists and engineers are developing new processors, such as the AIU family, that employ a variety of techniques to overcome the bottleneck von neumann architecture and accelerate AI computation.

Von Neumann bottleneck explained

Why does the von Neumann bottleneck exist?

The von Neumann bottleneck is named after the scientist and physicist who shared a draft of his stored-program computer concept in 1945. In that work, to presented a computer that had input/output mechanisms, external storage, a processing unit, a control unit, and memory that held instructions and data. He was advising for the US Army, but his description didn’t include any specific hardware that would help them avoid security clearance concerns. However, von Neumann architecture is not an exception to the rule that no single person makes a scientific breakthrough. Von Neumann architecture has become the standard since the publication of that work.

The primary advantage of the von Neumann architecture is its great degree of flexibility. “It was initially adopted for that reason, and it remains the most prominent architecture to this day.”

You can design memory and computing units independently and set them up pretty much anyway you choose. Because the best components may be chosen and combined depending on the application, this has historically made designing computer systems easier.

Upgrades are still possible for the cache memory, which is included onto a single chip together with the processor. “They are still not together. It gives designers some leeway when creating the cache independently of the CPU.

How the bottleneck von neumann architecture reduces efficiency

The von Neumann bottleneck poses a dual efficiency challenge for AI computing: how many model parameters (or weights) must be moved and how far they must move. Larger storage, which typically translates into more distant storage, is the result of more model weights. Because there are so many model weights, you can’t afford to store them for very long, so you have to continually throwing them away and reloading.

Data transfers that move model weights from memory to computation account for the majority of AI runtime energy consumption. In contrast, less energy is used for calculations. For instance, practically all of the operations in deep learning models are quite straightforward matrix vector multiplication problems. Compute energy is not insignificant; it still accounts for about 10% of contemporary AI workloads. Simply said, unlike in traditional workloads, it is no longer found to be controlling latency and energy usage.

Because processors and memory were less efficient ten years ago, at least as compared to the energy required to convey data, the bottleneck von neumann architecture wasn’t a major problem. However, throughout time, processing and memory have advanced more rapidly than data transfer efficiency, allowing processors to perform calculations considerably more swiftly. As a result, they are left idle while data passes through the von Neumann bottleneck.

The energy required to transport the memory increases with its distance from the CPU. An electrical copper wire is charged to transmit a 1 and discharged to transmit a 0, according to basic physical principles. The longer the wire, the more energy it takes to charge and discharge it. This increases delay because longer wires take longer to propagate or dissipate charge.

Although each data transmission has a minimal time and energy cost, you must load up to billions of weights from memory each time you need to propagate data through a Large language model. Since one GPU doesn’t have enough memory to hold them all, this can entail using the DRAM from one or more additional GPUs. The CPU completes its calculations after downloading them, then transfers the outcome to another memory region for additional processing.

Closing that distance is one solution in addition to removing the von Neumann bottleneck. The industry as a whole is trying to make data localization better. One such method, a polymer optical waveguide for co-packaged optics, was recently announced by IBM Research experts. By bringing fibre optic speed and bandwidth density to the edge of chips, this module greatly improves connectivity and drastically lowers the time and energy required for model training.

But because of all these data exchanges, training an LLM can take months with current hardware, using more energy than a normal US home uses throughout that period. Additionally, after training a model, AI continues to require energy. Because inferencing has comparable processing needs, it is similarly slowed down by the bottleneck von neumann architecture.

Avoiding the bottleneck

Generally speaking, AI computing is memory-centric rather than compute-heavy, and model weights are stationary. All you have to do is send information through a predetermined set of synaptic weights.

Because of this characteristic, he and his associates have been able to use the laws of physics to store weights in order to pursue analogue in-memory computing, which combines memory and computation. Among these methods is phase-change memory (PCM), which uses an electrical current to alter the resistivity of a chalcogenide glass to store model weights.

By doing this, one can lessen the energy used for data transfers and lessen the bottleneck von neumann architecture. However, there are other ways to get around the von Neumann bottleneck than in-memory computing.

An extreme example of near-memory computing, the AIU NorthPole processor stores memory in digital SRAM. Although its memory isn’t integrated with computation like analogue processors, each of its many cores has access to local memory.

The strength and potential of this architecture have already been shown in experiments. NorthPole outperformed the next most energy-efficient GPU by 47 times and the next lowest latency GPU by 73 times in recent inference tests conducted on a 3-billion-parameter LLM derived from IBM’s Granite-8B-Code-Base model.

The fact that models learnt on von Neumann hardware can be used on non-von Neumann devices should also be noted. It is actually necessary for analogue in-memory computing. PCM devices are used to deploy models that have been trained on traditional GPUs because they are not robust enough to have their weights modified repeatedly. Since SRAM memory may be rebuilt indefinitely, durability is a comparative advantage in near-memory or in-memory computing.

The reasons behind the persistence of von Neumann computing

Von Neumann architecture is ideal for other applications, but it poses a bottleneck for AI computing. Von Neumann architecture is ideal for processing computer graphics or other compute-intensive tasks, even if it presents challenges for model training and inference. Furthermore, in-memory computing’s low precision is inadequate when 32- or 64-bit floating point precision is required.

The von Neumann architecture is the most powerful architecture available for general-purpose computing. In these situations, bytes are either operands or operations travelling from a memory to a processor over a bus. “It’s similar to an all-purpose deli where customers may order pepperoni, salami, or this or that, but you can easily make six sandwiches at once because you have the right ingredients on hand.” However, special-purpose computing, such as AI computing, which ferries static model weights, would require 5,000 tuna sandwiches for a single order.

IBM researchers incorporate some traditional hardware for the required high-precision operations even when developing their in-memory AIU processors.

Experts concur that both hardware architectures will probably be used in the future, even as researchers and engineers try to find novel solutions to get rid of the bottleneck von neumann architecture. “It makes sense to use a combination of von Neumann and non-von Neumann processors to handle the tasks that each performs best.”

RELATED ARTICLES

Recent Posts

Popular Post