Use the OpenVINO Toolkit to Create a Browser Extension for RAG Applications
With the aid of an outside knowledge source, the Retrieval Augmented Generation (RAG) methodology is intended to improve the calibre of responses produced by Large Language Models (LLMs). LLMs leverage external knowledge sources, including documents, webpages, etc., in RAG-based applications to produce content based on source data.
The article describes how to create a RAG-based browser extension that effectively summarises the text from a web URL or any supplied PDF by utilising the Intel Distribution of OpenVINO toolkit (OpenVINO Toolkit).
OpenVINO Toolkit
Deep learning inference can be optimised, accelerated, and deployed with best-in-class performance on a variety of Intel processors and other hardware platforms with the help of the robust OpenVINO toolkit. Building RAG-based browser extensions with the OpenVINO toolkit can have a number of benefits, including model compression, hardware acceleration, and effective RAG model deployment across several operating systems, such as Windows and Linux.
In order to facilitate the offloading of LLM onto Intel GPUs for inference, Intel demonstrates here how to convert Hugging Face (HF) models to OpenVINO Intermediate Representation (IR) format using the OpenVINO toolbox.
RAG based Browser Extension
By summarising text from webpages or PDFs, this RAG-based browser plugin will improve the browsing experience by utilising both LLM and RAG. After obtaining pertinent data from the source, it provides the user with a succinct synopsis. Additionally, the plugin features interactive question answering, which enables users to pose follow-up queries in response to the condensed information. Because of this, it is a useful tool for rapidly comprehending complicated information and offers the freedom to go deeper through a conversational interface.
The operation of a browser extension based on RAG is shown in the flowchart below:

- Initially, users will be able to select the LLM model that will be loaded into the GPU with the use of the OpenVINO toolbox.
- After that, customers have the option to upload a PDF or supply a URL, in which case the summary will be produced.
- Following summary generation, the plugin displays a specific chatbot interface where users can pose follow-up queries.
Components
Using components from LangChain, a RAG-based browser addon efficiently pre-processes user-provided data, including text chunking, embedding, and storage, to produce accurate and useful outputs. The following are the main elements utilised in this extension:
- Optimum for Intel: The Optimum Intel API makes it easier to convert, optimise, and use LLMs for inference. It reduces the model size and achieves faster inference while facilitating model translation to the OpenVINO IR format, optimisation with the Neural Network Compression Framework (NNCF), and integration with other OpenVINO tools. OpenVINO-optimized models are loaded and model compilation is offloaded to Intel GPUs using the OVModelForCausalLM.
- More than 100 document loaders for different document types are supported by LangChain. This plugin extracts text from web pages and PDFs using WebBaseLoader and PyPDFLoader, respectively.
- Text Splitters: To maximise language model performance, huge documents can be divided into smaller pieces using LangChain’s integrated text splitter. RecursiveCharacterTextSplitter is used in this addon to divide text into digestible chunks.
- Text Embedding Models: Text embeddings allow for semantic similarity measures between various pieces of material by providing vector representations of text. This plugin facilitates semantic search by converting text chunks into embeddings using HuggingFaceEmbeddings from LangChain with Sentence Transformers.
- Vector Store: Text embeddings are managed and stored using Chroma, a LangChain vector store. High-dimensional vectors for relevant information can be efficiently retrieved thanks to this specialised database.
- Retrievers: The system retrieves the most pertinent vectors based on similarity once the user’s input is embedded into a vector during the querying phase. Semantic search is one of LangChain’s retrievers that aids in quickly locating the most relevant information.
- Chain: RetrievalQA chain is used to extract pertinent documents from the retriever and produce answers depending on the content retrieved, enabling interactive question-answering.

Implementation of a Code Sample
This code sample demonstrates how to easily integrate Flask and use the OpenVINO backend to construct a browser extension that quickly and effectively summarises webpages (by URL) and PDFs (via upload).
We must set up the environment before executing the code sample:
- Installing the required software, such as Miniforge and Git.
- The code sample repository should be cloned.
git clone https://github.com/intel/AI-PC_Notebooks.git
cd AI-PC_Notebooks/Text-Summarizer-Browser-Plugin
Installing the required packages and setting up a conda environment as follows:
conda create -n summarizer_plugin python=3.11 libuv
conda activate summarizer_plugin
pip install -r requirements.txt
I. Get the HF Model and convert it to OpenVINO IR format
Create a token by logging into the HF hub. Use the steps outlined in this page to gain access to the private or gated models. Use the optimum-cli command to download and convert the models (qwen2-7b and llama-2-7b) into OpenVINO IR format.
optimum-cli export openvino –model meta-llama/Llama-2-7b-chat-hf –weight-format int4 ov_llama_2
optimum-cli export openvino –model Qwen/Qwen2-7B-Instruct –weight-format int4 ov_qwen7b
Note: Since Llama models are a gated resource, submit an access request.
II. The Chrome extension loading
Consult Chrome’s development docs to load a custom extension in developer mode.
III. Execute the sample
- Open the flask server by running -python server.py in the terminal (or command prompt if using Windows) and navigate to the backend folder.
- Pin and activate the loaded extension in the manner described below.
- Select the model using the drop-down menu.
- Next, select PDF or Web Page.
- Online Summariser:
- To summarise, enter the webpage’s URL.
- Pick “Summarise” from the menu.
- Once the text has been summarised, users can ask further questions.
- Summariser for PDFs:
- Put a PDF file online.
- After selecting “Upload & Summarise.”
- Once the text has been summarised, users can ask more questions.
- To restart, either reopen the plugin or refresh the page.
What Comes Next
Through iterative experimentation, make use of Intel’s AI PCs to promote creativity and hasten the development of generative AI applications.
To provide better user experiences, browser extensions make use of LLMs and vector databases. Users can access information, create original material, and expedite their operations with its smooth integration with web content and real-time responses. Additionally, look through the GenAI playground GitHub repository, which has various example notebooks that demonstrate how to create whole GenAI applications on AI PCs.