Feed Forward Network Fusion: A Development In LLM Inference

Large Language Models Executive Synopsis

By parallelizing Feed Forward Network (FFN) layer sequences, FFN Fusion is a unique architectural optimization strategy for Large Language Models (LLMs) that dramatically lowers inference latency and processing cost. The main finding is that subsequent FFN layers show unexpectedly little inter-layer dependencies after removing specific attention layers (typically using methods like Puzzle). These layers can then be merged into a single, larger FFN layer that enables parallel execution across GPUs.

By developing Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base) from Llama-3.1-405B-Instruct, the authors show the efficacy of FFN Fusion. They achieve a 1.71x speedup in inference latency and a 35x reduced per-token cost while maintaining high performance on several benchmarks. According to the study, FFN Fusion can be used in conjunction with other optimization techniques and performs better at bigger model scales. Interestingly, early research suggests that even entire transformer blocks may be parallelized in some circumstances.

Key Concepts and Key Themes

Rethinking Sequential Computation

The growing size of LLMs results in high computing requirements, especially during inference, which restricts their usability. There are drawbacks to conventional optimization methods like quantization and pruning, and Mixture-of-Experts (MoE) architectures might not be the best option for all batch sizes. This calls for investigating complementary efficiency enhancements.

FFN Fusion Concept

There is a pressing need for breakthroughs that can make these models’ capabilities publicly available because, although pushing the limits of artificial intelligence, their deployment costs and resource requirements significantly restrict their accessibility.

Finding FFN layer sequences that become noticeable when attention layers are eliminated is the fundamental principle behind the Feed Forward Network(FFN) Fusion Concept, which is frequently made possible by the Puzzle neural architecture search framework. It is possible to mathematically combine these successive FFNs to create a single, broader FFN layer. This change lowers synchronization overhead and improves hardware utilization by allowing what was once a sequential process to be executed in parallel.

The main finding is that Feed Forward Network (FFN) layer sequences, especially those that remain after certain attention layers are eliminated, can frequently be parallelized with little effect on accuracy. It create a methodical approach to find and combine these sequences, turning them into parallel processes that greatly lower inference latency without compromising model behavior.

Through the recognition and utilization of computational independence patterns in FFN layers, the methodology permits parallel execution on several GPUs while maintaining model functionality. On contemporary GPU nodes, where tensor-parallel implementations frequently have synchronization delays between successive layers, this parallelization works very well. It approach greatly increases hardware utilization by eliminating cross-device communication and focusing the processing into fewer levels.

Theorem of FFN Equivalence

Using properly concatenated weight matrices, the paper formally establishes that a series of FFN layers can be mathematically equivalent to a single FFN layer. This gives the FFN Fusion process its theoretical foundation.

The weights of FFNⁱ are Wⁱ₁, Wⁱ₂, Wⁱ₃

“For n ∈ N, let FFN¹ …….FFNⁿ be a sequence of FFN functions. With the weight matrices provided by:

W₁^* =[(W₁¹)^T……….(W₁ⁿ)^T] ^T

W^*₂ =[(W₂¹)^T……….(W₂ⁿ)^T] ^T

W₃^* =[(W₃¹)^T……….(W₃ⁿ)^T]

The theorem are written for the simple case where d_h is equal for all the FFNs.

Efficiency Gains

By lowering the number of layers and perhaps the number of parameters (when paired with pruning), FFN Fusion can improve memory footprint, dramatically reduce inference latency, and cut per-token cost. The fused layers also enable more effective scaling with tensor parallelism. “Ultra-253B-Base achieves a reduced memory footprint with half the attention layers and 253B parameters (down from 405B), a 1.71 × speedup in user latency, and a 35 × lower per-token cost at batch size 32.”

Pairwise Block Dependency Analysis

By calculating the cosine distance between a block’s contribution and its contribution after another block is removed, the authors present a way to measure the dependency between various transformer blocks. Block sequences with low interdependency that are appropriate for parallelization are identified by this approach, especially those that consist solely of Feed Forward Network(FFN).

Therefore, a tiny cosine distance suggests relative independence, a property that can be used to increase parallel computation, as deleting block $i$ has little influence on block $j$. On the other hand, a substantial reliance is correlated with a large cosine distance, suggesting that sequential processing is more important to sustain performance.

Ultra-253B-Base Model

By developing Ultra-253B-Base, a highly effective model developed from Llama-405B, the study demonstrates the usefulness of FFN Fusion. After knowledge distillation and alignment, this model shows that large speedups can be accomplished without sacrificing a great deal of accuracy, even outperforming the original model on some metrics.

FFN Sensitivity at the End

The final FFN layer in a lengthy sequence of attention-removed blocks appears to be more sensitive to fusion, which frequently results in a bigger accuracy reduction, according to empirical tests on smaller models (such as the 49B Puzzle variant of Llama-70B). As a result, leaving out the final FFN from the fusion process is frequently advantageous.

Each attention-removed sequence’s last Feed Forward Network(FFN) seems to be particularly significant to the model’s representations. Although the majority of layers can be fused safely, adding the final FFN frequently results in a noticeable decrease in accuracy. For effective fusion with little performance loss, it is therefore usually more dependable to leave this last FFN out of the fused groups.

Explainability of FFN Fusion

The study offers a theory as to why FFN Fusion functions, speculating that the FFN layers may be making comparatively minor directional adjustments to the token embeddings in the fused regions. The observation of a decreased ratio of the FFN’s contribution to the input’s magnitude and a lower cosine distance between the input and output of FFN layers in the fusion zones support this. The notion of low inter-layer reliance in these places is further supported by the fact that it was possible to reverse the order of FFNs with no effect on performance.

Exploration of Block Parallelization

Initial research looks into the potential of parallelizing entire transformer blocks, encompassing both attention and FFN. The block dependency analysis can assist in identifying relatively independent block sequences for possible parallel execution, opening up new architectural design possibilities, even if it is more difficult because attention and FFN are interrelated. For large-scale deployments, this is said to require more adaptable environments than the highly optimized inference frameworks that are now available.

Ultra-253B-Base creation

In order to meet hardware limitations and attain a 1.5x latency speedup, Puzzle Search was implemented on the Llama-405B. This produced a 253B parameter baseline with 50 consecutive attention-removed blocks.

FFN Fusion: Four larger FFN layers were created by fusing 49 of the 50 successive FFN levels.

Knowledge Distillation (KD): Performance was recovered following fusion using a multi-stage KD from the parent model.

Alignment: For additional optimization, RLHF and instruction tweaking were used.

Continuous Pretraining (CPT): Stronger performance was also obtained with longer CPT that was not aligned.

Key Results

Ultra-253B-Base Performance

Met or surpassed Llama-405B’s capabilities by achieving state-of-the-art performance on important benchmarks (Arena Hard: 84.92%, HumanEval: 86.58%, MMLU Instruct: 87.54%, MMLU-Pro: 72.25%, MT-Bench: 9.19).

Efficiency Gains

Compared to Llama-405B on a single NVIDIA H100 node, there is a 1.71x speedup in inference latency and a 35x reduction in per-token cost at batch size 32. up to 202 tokens per second with speculative decoding and 90.05 tokens per second on H200.

Memory Reduction

Parameters are reduced from 405B to 253B and kv-cache memory is doubled.

70B Scale Experiments

Experiments on a 70B scale showed that FFN Fusion worked well on a 49B derivative of Llama-70B, with incremental fusion stages exhibiting a trade-off between accuracy and latency reduction. After aggressive fusion, performance was recovered with the aid of knowledge distillation.

FFN Removal vs. Fusion

The significance of maintaining the computational capability in a parallel form was highlighted by the noticeably greater accuracy losses that resulted from removing FFN layers as opposed to fusing them.

Implications and Prospects for the Future

New Perspective on Model Interpretability: The patterns in inter-layer dependencies that have been noticed may provide information on how LLMs handle data.

Architectural Innovations: New architectural designs that are optimised for parallel execution are suggested by the possibility of parallelising entire transformer blocks.

Extension to MoE Models: Additional efficiency gains may result from investigating FFN Fusion for Mixture-of-Experts designs.

Relationship Between Model Size and Parallelization: More research is necessary to fully understand the enhanced efficacy of FFN Fusion at greater scales.

Orthogonal Optimization: FFN Fusion can be used independently of quantization and pruning methods, indicating that combining them could result in multiplicative efficiency increases.

Potential Challenges and Considerations

Accuracy Trade-offs

Although FFN Fusion strives for low accuracy impact, vigorous fusion can nevertheless result in performance degradation, necessitating potential retraining (e.g., knowledge distillation) and cautious fusion candidate selection.

Hardware and Software Support

To effectively benefit from parallel execution, fused FFN layers must be implemented efficiently with the right hardware and software support.

Complexity of Block Parallelization

Compared to fusing FFN-only layers, parallelizing entire transformer blocks is more complicated and may necessitate major changes to current inference systems.

Finding the Best Fusion Strategies

It may take a lot of testing and analysis to figure out which Feed Forward Network(FFN) layer sequences to fuse and to what extent.

Conclusion

By taking advantage of the parallelism that exists naturally in FFN layer sequences, especially those that emerge after attention pruning, FFN Fusion offers a promising method to greatly increase the inference efficiency of large language models. The development of Ultra-253B-Base provides a powerful illustration of the practical application of this method. For upcoming generations of LLMs, the results pave the way for new research directions in model optimization, interpretability, and architectural design.

Feed Forward Network Fusion: A Development In LLM Inference

Large Language Models Executive Synopsis

Key Concepts and Key Themes

Rethinking Sequential Computation

FFN Fusion Concept

Theorem of FFN Equivalence

Efficiency Gains

Pairwise Block Dependency Analysis

Ultra-253B-Base Model

FFN Sensitivity at the End

Explainability of FFN Fusion

Exploration of Block Parallelization

Ultra-253B-Base creation

Key Results

Ultra-253B-Base Performance

Efficiency Gains

Memory Reduction

70B Scale Experiments

FFN Removal vs. Fusion

Implications and Prospects for the Future

Potential Challenges and Considerations

Accuracy Trade-offs

Hardware and Software Support

Complexity of Block Parallelization

Finding the Best Fusion Strategies

Conclusion

Google Magic Mirror Experience Driven by Gemini Models

Pluto AI: A New Internal AI Platform For Enterprise Growth

Bolttech Improves Customer Experience with AWS Generative AI

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

AMD Radeon Pro W6600 Benchmark in CAD, Video Editing

Intel Core Ultra 5 225H Performance for Everyday Tasks

Intel Core i9 13900K Price, Benchmark, and Specifications

NVIDIA Tesla V100 Price, Features And Specifications

Google Magic Mirror Experience Driven by Gemini Models

Pluto AI: A New Internal AI Platform For Enterprise Growth

About Us

Tutorials