Optimizations LLM Chat Performance For Intel Flex GPUs with RAG

By Drakshi

June 2, 2024

0

106

LLM chat platforms for RAG deployment and Intel Flex GPUs for Twixor chat will be examined.

Components of the RAG Solution

Now, let’s examine the RAG platform and how the solution functions.

Platform Haystack for LLMs

Deepest created the open-source Python framework Haystack to facilitate the development of customized applications utilizing large language models (LLMs). Its goal is to make the construction of cutting-edge natural language processing (NLP) systems easier by offering an extensive collection of tools and parts. This is a summary of the Haystack framework, emphasizing Retrieval Augmented Generation (RAG) and the InMemoryDocumentStore, two of its main components.

Memorandum Document Store

Documents are stored in memory using the InMemoryDocumentStore, a lightweight document storage. It is not advised for use in production workloads due to its restrictions, which include the inability to persist data and the requirement to scan all documents for each query. However, it is appropriate for experimentation and small-scale applications.

Transformers of Sentences

The Sentence Transformers library is a useful NLP tool for creating semantically rich embeddings for sentences, paragraphs, and images. A flexible and strong library for creating embeddings that work with a variety of NLP applications is called Sentence Transformers. Its broad pretrained models, ease of usage, and fine-tuning support make it an invaluable tool for both developers and academics.

RAG Implementation

Twixor used Retrieval Augmented Generation (RAG) to try and increase the chat’s accuracy. Indexing and vectorizing a client knowledge base that can be queried using RAG was the use case.

To deploy RAG, the Haystack framework was utilised. The Sentence transformer library was used to vectorize customer knowledge base papers that were in PDF format.

One well-liked Python library for creating dense vector representations (embeddings) of text data, such as sentences, paragraphs, and documents, is the Sentence Transformer library.

This is how text data can be vectorized using it

Putting a Pre-trained Model in Place

A pre-trained Sentence Transformer model must first be loaded. Numerous models that have been pre-trained on diverse datasets for a range of tasks are readily available.

Production of Embeddings

After loading the model, you can use the model.encode() method to create vector embeddings for your text data by feeding it a list of strings. Utilizing Embeddings These embeddings can then be applied to a number of tasks related to natural language processing, such as:

To identify the most pertinent documents using semantic search, compute the cosine similarity between the query and document embeddings.
Clustering: Utilizing their embeddings, group related sentences or texts together.
Deduplication: Compare the embeddings of almost identical texts to eliminate them.
Classification: For text classification problems, train a classifier using the embeddings.

Utilizing the Document Store in Memory:

They used Haystack’s `InMemoryDocumentStore, a simple and light document storage intended for rapid development and experimentation. It is perfect for testing and small-scale applications because it doesn’t require any other services or dependencies.

RAG Findings:

Prior to and following the implementation of RAG, They conducted a qualitative analysis of chat responses. Intel discovered that by including external, current, and contextually relevant data into the model’s generative process, the Retrieval-Augmented Generation (RAG) implementation on the NeuralChat model considerably improved the response quality of Large Language Models (LLMs).

The RAG deployment for Twixor resulted in the following significant improvements

Because LLM chat is usually trained on static datasets, their understanding is restricted to the data that was accessible as of the training cut-off date. The responses were current and provided precise information. This restriction was overcome by the RAG implementation for Twixor, which made sure that the answers were accurate and up to date while also obtaining the most recent and pertinent information from outside sources.

Decrease in Hallucinations: When LLM chat provide information that seems plausible but is inaccurate or incoherent, it can lead to hallucinations. The implementation of RAG resulted in a decrease in hallucinations and an improvement in the factual quality of the generated information, since it was now sourced from dependable sources.

Improved Contextual Relevance: Following RAG, the programme obtained information and documents that were very pertinent to the user’s search. This made guaranteed that the generated answers were accurate and extremely pertinent to the query’s particular context.

Enhanced Reliability and Trust: RAG’s response increased user trust by providing sources and citations for the data utilized to generate the responses. Verifying the accuracy of the responses was possible, which was crucial for applications that need a high level of dependability.

Cost-Effectiveness and Efficiency: RAG was a cost-effective solution because it did not require the LLM chat to undergo lengthy retraining when it was implemented by leveraging the data that was already available.

Flexibility and Adaptability: By integrating domain-specific knowledge bases, Intel were able to modify RAG to fit different domains. Because of its adaptability, the Neural Chat-derived LLM can function effectively in particular tasks without requiring resource-intensive domain-specific fine-tuning.

Scalability: Research indicates that when more data is made accessible for retrieval, LLM chat performance increases. RAG improves the quality of responses even with enormous volumes of data because it scales well with large datasets.

Intel Data Centre GPU Flex Series 140 Neural Chat

During the second stage of this project, They inferred the Neural Chat LLM for Twixor using the Intel Data Centre GPU Flex Series 140. The objective was to evaluate the latency for a customer service chat application using a GPU that can be placed at the edge in conjunction with an XEON CPU and AMX.

Intel GPU Flex Series 140 Data Centre Intel

AI visual inference workloads in the data centre can be accelerated with the Intel Data Centre GPU Flex Series 140. The following are some salient features of its AI inference capabilities:

AI Inference Capabilities

With two DG2-128 GPUs, each with 1024 cores, the Flex 140 can handle AI inference workloads in parallel thanks to a total of 2048 cores.
To conduct inference on the GPU, it requires less code changes than popular AI frameworks and libraries like TensorFlow, PyTorch, and Intel’s OpenVINO toolkit.
When compared to NVIDIA A10 GPUs, Intel promises up to two times the AI inference throughput at half the power consumption.

Acceleration of Hardware

Matrix multiplication and convolution operations are two AI tasks that the Flex 140 GPUs are specifically designed to accelerate using dedicated hardware.
To speed up ray tracing for visual inference tasks, each GPU has eight ray tracing cores.
For effective AI computation, the GPUs provide important AI instructions like INT8 and BF16 precision.

Stack of Open Software

The Flex GPUs have an open and standards-based software stack thanks to Intel’s oneAPI programming approach.
As a result, programmers may create cross-architecture AI solutions that work with different CPUs, GPUs, and accelerators.
For effective AI implementation, Intel offers optimised libraries such as Intel AI Analytics Toolkit, OpenVINO, and oneDNN.

NeuralChat LLM with Intel Data Centre GPU Flex Series 140 Chat Solution

They previously shown how Intel collaborated with Twixor, a customer, to choose and optimize an LLM chat with Intel AMX for their chat solution in an Intel case study and blog series. Twixor required this chat application to be available on the edge, where the CPUs might not have AMX capabilities, hence this solution is an improvement over the previous case study.

The foundation of this solution is the ability for edge deployment using Intel Data Centre GPU Flex Series GPU and Intel SW optimizations. As a starting point for chat applications, Twixor looked into pre-existing open-source LLMS from the Hugging Face AI group. The LLM for the chat application was once again Intel-optimized NeuralChat, together with RAG and Intel Flex GPU 140.

NeuralChat-7B Utilizing Intel GPUs for testing

In order to replicate the Edge use case, an Intel Flex Series GPU 140 was put in a PCI Slot of a generic x86 server. The GPU card’s additional power consumption was within the bounds of what an edge server could handle, and the PCI slot was standard. A virtual machine with four virtual CPUs and sixteen gigabytes of RAM was created using VMware virtualization, and the two different GPU cards that are part of the Intel Flex Series GPU 140 were passed through into the virtual system.

LLM chat

The same scripts were subsequently used to run NeuralChat-7B with INT4 quantization on the GPUs, and the LLM chat latency for 90 tokens was measured. Utilizing the PyTorch and Transformers Intel optimizations for these GPUs, They were able to attain equivalent performance to Xeon with AMX, as the table below demonstrates.

Conclusion

In conclusion, Intel discovered that RAG improves contextual relevance and reliability, decreases hallucinations, and gives access to up-to-date, accurate, and relevant information, all of which help Neural Chat LLMs provide better responses. When deploying LLMs in dynamic, knowledge-intensive situations, RAG is a potent tool.