The first open models in the world, DataGemma establish LLMs in the massive, real-world statistical datasets of Google’s Data Commons, thereby helping to overcome the problems associated with hallucinations.
The large language models (LLMs) that drive the advances in AI today are getting more and more complex. These models are capable of sifting through enormous volumes of text to produce summaries, new creative directions, and even code drafts. Despite how amazing these talents are, LLMs occasionally provide false information with confidence. One of the main issues with generative AI is this behavior, called “hallucination.”
Today, Google is presenting exciting new study findings that directly address this problem by grounding LLMs in empirical statistical data, which helps lessen hallucinations. In addition to these academic advances, it is pleased to present DataGemma, the first publicly available models created to link LLMs with large amounts of real-world data taken from Google’s Data Commons.
Google Data Commons
Data Commons: An extensive collection of reliable data that is accessible to the public
Across hundreds of thousands of statistical variables, Data Commons is a publicly accessible knowledge network with over 240 billion rich data points. It obtains this public data from reliable sources such as the World Health Organization (WHO), the United Nations (UN), the Centers for Disease Control and Prevention (CDC), and Census Bureaus. Bringing these datasets together into a single, cohesive collection of tools and AI models allows researchers, policymakers, and organizations looking for precise insights.
Consider Data Commons to be an enormous, ever-growing library of trustworthy, publicly available data on a variety of subjects, from economics and health to demographics and the environment, that you can engage with using Google’s artificial intelligence (AI)-powered natural language interface. For instance, you can investigate which African nations have seen the biggest increases in access to electricity, how income and diabetes are correlated in US states, or you can ask your own data-driven question.
How to combat hallucinations with Data Commons
Google’s goal is to anchor the growing use of generative AI by incorporating Data Commons into Gemma, its family of cutting-edge, lightweight open models that are constructed using the same technology and research as the Gemini models. Researchers and developers can begin using these DataGemma models immediately.
By utilizing the expertise of Data Commons to improve LLM factuality and reasoning through two unique methods, DataGemma will increase the capabilities of Gemma models:
- By proactively searching reliable sources and fact-checking against data in Data Commons, RIG (Retrieval-Interleaved Generation) improves the capabilities of its language model, Gemma 2. When DataGemma receives a prompt to provide an answer, the model is configured to find examples of statistical data and obtain the response from Data Commons. Although the RIG methodology is not new, the way it is used within the DataGemma framework is distinct.
- Language models can now incorporate relevant information beyond their training data, take in additional context, and provide more thorough and informative outputs thanks to RAG (Retrieval-Augmented Generation). This was made possible with DataGemma by utilizing the extended context window of Gemini 1.5 Pro. In order to reduce the possibility of hallucinations and improve answer accuracy, DataGemma gets pertinent contextual information from Data Commons prior to the model starting to generate responses.
Future directions and promising outcomes
Although early, Google’s initial RIG and RAG results are promising. Notable improvements have been shown in its language models’ accuracy when processing numerical information. According to this, users may have fewer hallucinations when using it for research, making decisions, or just sating their curiosity.
Its study is still ongoing, and they are dedicated to improving these approaches even more as they ramp up this work, put it through a rigorous testing schedule, and eventually incorporate this improved capability into the Gemma and Gemini models, first in stages with restricted access.
We hope that Google can encourage a wider adoption of these Data Commons-led methods for firmly establishing LLMs in factual data by disseminating its findings and reopening this most recent Gemma model version as an “open” model. Building a future where AI provides humans with accurate information, encourages educated decision-making, and fosters a deeper awareness of the world around us will require making LLMs more dependable and trustworthy.
These quickstart notebooks for both the RIG and RAG techniques can be used by researchers and developers to get started with DataGemma.