Tuesday, April 1, 2025

Granite Vision VLM For Open-Source Chart, Data Visualization

IBM Granite Vision

IBM’s new corporate vision-language model Businesses are getting closer to automating a variety of document comprehension activities because to AI’s ability to retrieve information hidden in tables, charts, and other images.

A thousand words can be summed up in a picture. Any annual report’s vibrant charts, tables, and visuals can help readers focus on the most crucial information.

By condensing a sea of words and figures into a concise, gripping narrative, data visualizations make difficult material easier to understand and even memorable. Furthermore, whereas AI models are excellent at summarizing text, they frequently fall short when it comes to neat visualizations.

Understanding how to interpret intricately woven linguistic and graphical data is necessary to understand key takeaways in a chart or table. The graphical data which humans find so appealing might be difficult for even multi-modal language models trained on both text and images to understand.

IBM Research aimed to bridge this gap by developing an open-source vision-language language model (VLM) that could analyze enterprise reports’ staple data visualizations, such as tables, charts, and other visual representations, in addition to natural images. Hugging Face now offers Granite Vision‘s initial version, which was made available under an Apache 2.0 license.

Granite Vision operates quickly and affordably. In terms of information extraction from the tables, charts, and diagrams found in well-known document understanding benchmarks, it is also competitive with other small, open-source VLMs.

The foundation of Granite Vision is IBM’s cutting-edge 2 billion-parameter Granite language model, which offers enhanced function calling, a bigger context window of 128,000 tokens, and increased accuracy on retrieval-augmented generation (RAG) tasks. About 4.2 million natural photos and 13.7 million pages of business papers were used to refine Granite Vision. IBM thoroughly examined the training data to remove any harmful, proprietary, or private material, just like it did with previous Granite releases.

 IBM Granite Vision datasets
Image Credit To IBM

The encoder that converts input images into numerical visual embeddings and the projector that converts those embeddings into text embeddings that a language model can read are what give the model its visual capabilities. In order for the model to be able to extract the appropriate information and produce a logical response when asked about an image that has never been seen before, these representations are aligned with text embeddings that correlate to enquiries about the image during training.

Granite Vision was trained on 80.3 million pairs of document photos and 16.3 million pairs of natural photographs, in addition to raw images. This resulted in almost 100 million pairs of questions and responses that matched the content of the images.

Synthetic data for comprehension of tables, charts, and diagrams

Researchers focused on document understanding for this initial Granite Vision release, which entails dissecting a page’s layout and visual components to derive high-level conclusions about its content.

They created a structured dataset from 85 million raw PDF pages that were collected from software applications and the internet, including receipts, business forms, and images from auto accidents, using Docling, IBM’s open-source document conversion engine.

They created 26 million pairs of fictitious questions and answers using a Mistral LLM from a randomly chosen sample of this data. They added verbalized descriptions of graphical features to the underlying documents and supplemented tables with extra computations to produce more difficult questions.

They believed that by making the questions more challenging, IBM Granite Vision would gain a deeper comprehension of the content, which primarily consisted of tables, charts, and diagrams that illustrated business processes. Invoices, resumes, and other forms with pre-defined fields that are difficult for robots to understand were also added.

Granite Vision may have outperformed other VLMs, including those that were twice as large or larger, on the well-known ChartVQA benchmark and IBM’s new LiveXiv benchmark, which is updated every month to reduce the likelihood that a model trained on the test data. This could be explained by the targeted instruction data.

ChartVQA benchmark and IBM's new LiveXiv benchmark
Image Credit To IBM

Visual documents and beyond

Time can be saved in the workplace by using an AI to deconstruct visual materials. Additionally, it can help businesses automate visual reasoning jobs that are either extremely repetitious or call for precision that only robots can attain.

Researchers intend to increase the model’s ability to analyze natural photos in subsequent Granite Vision versions so that it can do additional enterprise jobs. This could involve analyzing hundreds of bills at once, detecting product flaws, or gleaning information about auto accidents from photos.

Additionally, they intend to expand the paradigm to include multi-page manuscripts. It is difficult to train large language models(LLMs) and VLMs on multi-page data because model context windows are frequently too tiny. Models must process the data at a reduced resolution in order to handle the spillover, which may degrade performance. It is also more technically difficult to generate queries that make reference to multiple pages of information at once.

The team intends to include a customizable safety module in upcoming Granite Vision releases to filter incoming user prompts for harmful or improper content. Without altering the model’s weights and possibly impairing its overall performance, the module offers a method for teaching the model to identify dangerous text and images via sparse attention vectors.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post