IBM created “Bamba” by crossing a transformer with an SSM. IBM Research developed an open-source LLM in partnership with CMU, Princeton, and the University of Illinois that combines the speed of a state-space model at runtime with the expressive capability of a transformer. IBM Granite 4.0 will soon get important features.
IBM SSM
The transformer architecture that powers today’s massive language models has demonstrated an amazing capacity to produce text that is human-like. Its self-attention mechanism, which enables the model to consider every word in an input sequence when producing a response, contributes to its efficacy.
The longer the conversation, the bigger the issue. The cumulative cost of generation increases quadratically as a result of the model’s memory retention of the running sequence during response. The cost of processing the context and producing a response not only doubles but quadruples if the context window’s size doubles.
The annoying delay between posing a query to the model and receiving a response is frequently caused by this “quadratic bottleneck.” Additionally, it generates a great deal of duplicated computing. Scholars were already looking for different architectures by 2022, when ChatGPT made the transformer widely known.
Two potential solutions have been identified
State-space models (SSMs) and transformers interspersed with IBM SSM layers. Bamba, a model that can parse lengthy sequences as deftly as a transformer and operate as fast as an SSM, is IBM Research’s first hybrid experiment, which was just made publicly available. IBM’s next-generation Granite 4.0 machines, which will be available in a few months, incorporate several of Bamba’s improvements.
Bamba-9B has demonstrated that it can operate at least twice as quickly as transformers of comparable size while maintaining accuracy by drastically lowering the memory needs of the transformer’s KV (key value) cache memory. It all boils down to the IBM researcher spearheading the project’s KV cache reduction. Longer context length, reduced latency, and increased throughput.
State-space models, the most significant model you’ve never heard of, are used to model dynamic systems for decades, yet they don’t even come close to having the same name recognition as transformers.
They are essential to robotics, control theory, signal processing, and electrical engineering. An IBM researcher has been instrumental in converting SSMs to deep learning. State-space models are used to analyze time-series data in any field.
The mathematical formulas upon which SSMs are based can be used to explain the weather, the stock market, and even electrical activity in the brain. An SSM determines a “hidden state” of a defined size from a set of observations, encapsulating the system’s key characteristics. Consider the state to be a synopsis of the past. When new information is received, the concealed state is updated with predictions for the future without growing in size.
When Stanford researchers Albert Gu and his colleagues published S4, an SSM that applied state variables to language, in 2021, SSMs transitioned to neural networks. The SSM performed well in processing word sequences, just like the transformer and recurrent neural networks (RNNs) that came before it. However, it was able to handle lengthy sequences far more quickly than transformers and with greater skill than RNNs.
An SSM keeps a compressed hidden state that compiles historical data, whereas a transformer attends to every word in the context window when generating a response. Faster inference speeds result from this selective retention of information, which also uses less memory overhead.
Although it was challenging to construct, S4 caused a stir when it unexpectedly surfaced on Long Range Arena, a benchmark for evaluating language models based on their capacity to handle lengthy sequences. Gupta, an AI resident at IBM, then assisted Gu and his team in employing diagonal state spaces to simplify the model. The 1,000 lines of code in S4 were reduced to 10 by its “diagonal” IBM SSM. Later, Gupta contributed to the introduction of a gating mechanism that filtered out extraneous information, enabling SSMs to match transformers’ “expressivity,” or sequence-modeling ability, for the first time.
What might be the first hybrid transformer was also revealed by that team. Given that he currently works on IBM’s Granite Vision models, it made appropriate to investigate hybrids. While using SSMs for longer-range contextualization, one could handle text with local dependencies using conventional attention blocks.
A wave of hybrids with names like Samba and Mamba Former were sparked by the 2023 release of the gated SSM variation Mamba2, which was unveiled by Tri Dao at Princeton and Gu, who was then a professor at CMU. Nvidia released Nemotron-H last year after confirming that these new hybrids could significantly speed up inferencing while outperforming either architecture alone.
Overcoming the bottleneck in the KV cache
Efficiency has been the foundation of IBM Research’s Granite LLMs for enterprise since the beginning. Researchers focused on the quadratic bottleneck as IBM Granite grew in size and capability. After internal validation of Nvidia’s results, IBM researchers proceeded to construct their own hybrid Bamba-9B.
Together, they decided on Nvidia’s Mamba2 architecture and made almost all of Bamba’s components open-source, including the data, training recipes, IBM’s data loader for large-scale distributed training, and a quantisation framework to reduce storage and inferencing expenses.
Bamba was first trained using 2 trillion tokens, which are words and word fragments. Motivated by the outcomes, they reduced the model’s bit width from Mamba2’s 16-bit floating-point precision to 8-bits, added an additional trillion tokens, and quantised the model to reduce its size from 18 GB to 9 GB. Ganti credits Bamba’s design and superior training data for its performance on important benchmarks, matching that of Meta’s Llama-3.1 8B model, which was trained on seven times as much data.
The optimization of vLLM to execute SSMs was their next challenge. The Bamba team collaborated extensively with Red Hat to include the model into the “virtual” LLM, which has become the preferred open-source inference server for Large Language Models(LLMs). Supporting SSMs is challenging since customized state management is required. Ganti asked the public to contribute to the improvement of Bamba when it was published towards the end of last year. In Bamba’s Hugging Face introduction, he wrote, “Let’s work together to overcome the KV-cache bottleneck.”
Bamba is capable of handling 32,000-token talks after being trained on 4,000-token sequences. However, Ganti stated that as vLLM adds additional support for SSM, it can reach one million tokens or more and operate up to five times faster than a transformer.