The RAG-FiT Open-Source Framework for Retrieval Augmented Generation in LLMs is Introduced by Intel Labs
Highlights
- RAG-FiT is an open-source framework developed by Intel Labs to enhance large language models for use cases including retrieval-augmented generation.
- RAG-FiT, which is available under an Apache 2.0 license, creates data-augmented datasets for training and assessing LLMs by combining data generation, training, inference, and assessment into a one workflow.
- Users may rapidly develop and test various RAG strategies with the Python-based framework, which acts as an end-to-end experimentation environment.
RAG-FiT is an open-source framework developed by Intel Labs to enhance large language models (LLMs) for use cases including retrieval-augmented generation (RAG). RAG-FiT, which is available under an Apache 2.0 license, helps create data-augmented datasets for training and assessing LLMs in RAG situations by combining data generation, training, inference, and assessment into a one process. Users may quickly create datasets and train RAG models using internal or specialised knowledge sources with this connection, which facilitates quick prototyping and experimentation with different RAG methodologies.
Using parameter-efficient fine-tuning (PEFT), which enables users to fine-tune a subset of a model’s parameters, the library helps provide data to train models. Users can quickly prototype and experiment with various RAG techniques, such as data selection, aggregation and filtering, retrieval, text processing, document ranking, few-shot generation, prompt design using templates, fine-tuning, inference, and evaluation, with the Python-based framework’s ability to function as an end-to-end experimentation environment.
Researchers from Intel Labs enhanced and optimised Llama 3.0 and Phi-3 models with various RAG configurations to illustrate the efficacy of the RAG-FiT framework (previously known as RAG Foundry), exhibiting steady gains in three knowledge-intensive question-answering tasks.
Utilising RAG Systems to Overcome LLM Restrictions
Notwithstanding their remarkable potential, LLMs are inherently limited. These models struggle with factual correctness, lack access to current information beyond their training cutoff, and have trouble paying attention to pertinent information in big settings. They can also generate responses that seem reasonable but are inaccurate or nonsensical.
Through the use of retrieval methods, RAG integrates external information to improve LLM performance. Knowledge limits can be efficiently addressed by retrieving particular facts from knowledge sources outside the model. This can eliminate hallucinations, increase the relevancy of generated material, offer interpretability, and potentially save a significant amount of money. Additionally, new studies show that optimising LLMs for RAG can produce state-of-the-art results, outperforming more extensive proprietary models.
The Operation of RAG-FiT
The core of the RAG-FiT library is made up of four separate modules: data creation, training, inference, and assessment. These modules serve as an experimental environment for researchers. To guarantee compatibility between the output of one module and the input of the subsequent file, each module is contained and managed by a configuration file. The development of different outputs and the simultaneous execution of several experiments are made possible by this modular method, which permits isolation and independent experimentation on each phase. Both the produced outputs and any attribute of the data, such as retrieval, ranking, and reasoning, may be evaluated.

Creation of datasets
By preserving RAG interactions which are crucial for RAG-oriented training and inference the processing module makes it easier to create context-enhanced datasets. Dataset loading, column normalisation, data aggregation, information retrieval, template-based prompt construction, and other pre-processing operations are all included in these interactions. Compatibility and repeatability across many models and experiments may be ensured by saving the processed data in a consistent, model-independent format with all related information.
Global dataset sharing allows the processing module to handle several datasets simultaneously. This feature increases flexibility and permits sophisticated processing operations by enabling access to any of the imported datasets at every stage of the pipeline. Step caching is another feature of the module that allows each pipeline step to be locally cached. This increases computational efficiency and makes reproducing results easier.
Instruction
Any model may be trained by users using the updated datasets. Models from the datasets produced by the preceding processing module are refined in a training module. For transformer reinforcement learning, the training module makes use of the well-known training framework, TRL. In order to tailor the LLM for certain use scenarios without retraining the entire model, the module also supports sophisticated efficient training approaches like PEFT and low-rank adaptation (LoRA).
Conclusion
The updated datasets containing trained or untrained LLMs can be used by the inference module to provide predictions. Since inference requires more computing power than evaluation, it is conceptually distinct from the evaluation stage. Users can also utilise a single prepared inference results file for many assessments.
Assessment
Exact Match (EM), F1 Score, ROUGE, BERTScore, DeepEval, Ragas, Hugging Face Evaluate, and categorisation are among the metrics that users may run or quickly customise. Metrics, like recall for classification-based metrics, can be conducted globally on the complete dataset or locally on each sample. Metrics can use any characteristic in the dataset, including reasoning, citations, attributions, retrieval results, and input and output texts. Furthermore, the assessment module makes use of a processing step known as an Answer Processor, which is capable of implementing custom logic and carrying out a variety of tasks, such as output alignment and cleaning.
Utilising RAG-FiT Augmentation Methods
Researchers from Intel Labs performed tests using retrieval, fine-tuning, chain-of-thought (CoT) reasoning, and a negative distractor documents approach to demonstrate the usefulness of the framework. Using augmentation techniques on three knowledge-intensive question-answering datasets (TriviaQA, PubMedQA, and ASQA), the team examined two commonly used baseline models, Llama 3.0 and Phi-3. While the ASQA dataset was retrieved using a dense retriever over a Wikipedia corpus, the TriviaQA and PubMedQA datasets provide pertinent context.

The group measured and reported accuracy and F1 Score for PubMedQA, STR-EM for ASQA, and EM for TriviaQA. Researchers also assessed two Ragas metrics: relevance (the generated text and the question) and fidelity (the relationship between the generated text and the context). Across the three knowledge-intensive question-answering tasks, the two models demonstrated steady gains overall.
Retrievable context enhanced TriviaQA’s performance, whereas adjusting the RAG setting and fine-tuning CoT reasoning which involves training on both distractor and gold passages decreased performance. The optimal approach for this dataset depends on the model. Every approach outperformed the baseline for ASQA, and CoT reasoning consistently improved both models. The CoT setup also performed best once it was fine-tuned. Lastly, for PubMedQA, practically all techniques outperformed the baseline (with one exception). CoT reasoning outperformed the untrained RAG setting, while the RAG method outperformed the other two models for fine-tuning.
Last but not least, the fidelity and relevance ratings frequently did not correlate with the primary metrics or with one another, which may suggest that they constitute a performance trade-off and reflect distinct facets of the retrieval and generated results.
The outcomes show how effective RAG approaches are in enhancing performance and how important it is to thoroughly assess various RAG system components across a range of datasets.