Friday, November 8, 2024

NL2SQL With Gemini And BigQuery: A Step-by-Step Guide

- Advertisement -

Conversion of natural language to SQL

Beginning to use Gemini and BigQuery for NL2SQL (natural language to SQL)

The intriguing new technology known as Natural Language to SQL, or NL2SQL, was created by combining the classic Structured Query Language (SQL) with Natural Language Processing (NLP). It converts inquiries written in common human language into structured SQL queries.

- Advertisement -

The technology has enormous potential to change how we engage with data, which is not surprising.

With the help of NL2SQL, non-technical users like marketers, business analysts, and other subject matter experts can engage with databases, examine data, and obtain insights independently without requiring specific SQL expertise. Even SQL experts can save time by using NL2SQL to create sophisticated queries, which allows them to devote more time to strategic analysis and decision-making.

On the ground, how does that appear? Imagine having instant access to a chat interface where you can ask inquiries and receive real-time replies, or

  • “How many units were sold overall this month?”
  • “What are the main factors influencing the shift in APAC sales when comparing Q1 and Q2 sales?”

In the past, this would have required an expert to extract information from databases and turn it into business insights. By lowering obstacles to data access, it can democratize analytics by utilizing NL2SQL.

- Advertisement -

However, a number of obstacles prevent NL2SQL from being extensively used. We’ll look at NL2SQL solutions on Google Cloud and implementation best practices in this blog.

Data quality issues in practical applications

Let us first examine some of the factors that contribute to the difficulty of implementing NL2SQL.

Real-world production data poses a number of difficulties, even if NL2SQL performs best in controlled settings and straightforward queries. These difficulties include:

Data formatting variations: The same information can be expressed in a variety of ways, such as “Male,” “male,” or “M” for gender, or “1000,” “1k,” or “1000.0” for monetary amounts. Additionally, many organizations use poorly defined acronyms of their own.

Semantic ambiguity: Large Language Models (LLMs) frequently lack domain-specific schema comprehension, which results in semantic ambiguity. This can cause user queries to be misinterpreted, for example, when the same column name has many meanings.

Syntactic rigidity: If semantically correct queries don’t follow SQL’s stringent syntax, they may fail.

Unique business metrics: NL2SQL must manage intricate business computations and comprehend table relationships via foreign keys. To translate the question effectively, one must have a sophisticated understanding of the tables that need to be connected and modeled together. Additionally, there is no one standard approach to determine the business KPIs that each corporation should use in the final narrative report.

Client difficulties

Users’ questions are frequently unclear or complicated, so it’s not only the data that can be unclear or poorly formatted. These three frequent issues with user inquiries may make NL2SQL implementation challenging.

Ambiguous questions: Even questions that appear to be clear-cut can be unclear. For example, a query looking for the “total number of sold units month to date” may need to specify which date field to use and whether to use average_total_unit or running_total_unit, etc. The perfect NL2SQL solution will actively ask the user to select the correct column and use their input when creating the SQL query.

Underspecified questions: Another issue is queries that are not detailed enough. For example, a user’s question concerning “the return rate of all products under my team in Q4” does not provide enough details, such as which team should fully grasp the question. An optimal NL2SQL solution should identify areas of ambiguity in the initial input and ask follow-up questions to obtain a comprehensive representation of the query.

Complex queries that require a multi-step analysis: Numerous questions require several stages of analysis. Consider figuring out the main causes of variations in sales from quarter to quarter, for instance: A good NL2SQL solution should be able to deconstruct the study into digestible parts, produce interim summaries, and then create a thorough final report that answers the user’s question.

Dealing with the difficulties

In order to address these issues, Google designed Gemini Flash 1.5 as a routing agent that can categorize queries according to their level of complexity. It can enhance its results by applying methods like contribution analysis models, ambiguity checks, vector embeddings, and semantic searches after the question has been classified.

It reacts to instructions in a JSON format using Gemini. Gemini can act as a routing agent, for instance, by responding to the few-shot prompt that follows:

 JSON format using Gemini
Image credit to Gogole Cloud

Direct inquiries

The right column names in scope can be clarified for direct inquiries by utilizing in-context learning, draft SQL ambiguity checks, and user feedback loops. Additionally, simple questions can be guaranteed to generate SQL that is clear.

For straightforward inquiries, its method does the following:

  • Gathers quality question/SQL pairings.
  • Keeps samples in BigQuery rows.
  • Enables the question to have vector embeddings
  • Leverages BigQuery vector search to extract related examples based on the user’s query.
  • Adds the table structure, question, and example as the LLM context.
  • Produces a draft SQL
  • Executes a loop that includes a SQL ambiguity check, user feedback, refinement, and syntax validation.
  • Performs the SQL
  • Uses natural language to summarize the data.

Gemini appears to perform well on tasks that check for SQL ambiguity, according to its heuristic testing. Google started by creating a draft SQL model that had all of the table structure and context-related questions. This allowed Gemini to ask the user follow-up questions to get clarification.

Key driver analysis

Key driver analysis is another name for multi-step reasoning-based data analysis in which analysts must separate and organize data according to every possible combination of attributes (e.g., product categories, distribution channels, and geographies). Google suggests combining Gemini and BigQuery contribution analysis for this use case.

Key driver analysis adds the following steps to the ones done with direct questions:

  • The routing agent refers users to a key driver analysis special handling page when they ask a query about it.
  • From ground truth stored in a BigQuery vector database, the agent retrieves similar question/SQL embedding pairings using BigQuery ML vector search.
  • After that, it creates and verifies the CREATE MODEL statement in order to construct a report on contribution analysis.
  • Lastly, the SQL that follows is executed in order to obtain the contribution analysis report:
  • The final report appears as follows:
key driver analysis
Image credit to Google Cloud

With Gemini, you can further condense the report in natural language:

report in natural language using Gemini
Image credit to Google Cloud

Implementing NL2SQL on Google Cloud

Even though this can sound difficult, Google Cloud provides a comprehensive set of tools to assist you in putting an effective NL2SQL solution into place. Let’s examine it.

BigQuery vector search is used for embedding and retrieval

By using BigQuery for embedding storage and retrieval, it is possible to quickly find instances and context that are semantically meaningful for better SQL production. Vertex AI’s text embedding API or BigQuery’s ML.GENERATE_EMBEDDING function can be used to create embeddings. It is simple to match user queries and SQL pairs when BigQuery is used as a vector database because of its inherent vector search.

Contribution analysis using BigQuery

Contribution analysis modeling can find statistically significant differences throughout a dataset, including test and control data, to identify areas of the data that are producing unanticipated changes. A section of the data based on a combination of dimension values is called a region.

To help answer “why?” questions, the recently unveiled contribution analysis preview from BigQuery ML enables automated explanations and insight development of multi-dimensional data at scale.Regarding your data, “What happened?” and “What’s changed?”

The contribution analysis models in BigQuery, in summary, facilitate the generation of many queries using NL2SQL, hence increasing overall efficiency.

Ambiguity checks with Gemini 

The process of translating natural language inquiries into structured SQL queries is known as NL2SQL, and it is often unidirectional. Gemini can assist in lowering ambiguity and enhancing the output statements in order to boost performance.

When a question, table, or column schema is unclear, you may utilize Gemini 1.5 Flash to get user input by asking clarifying questions. This will help you improve and refine the SQL query that is produced. Additionally, Gemini and in-context learning can be used to expedite the creation of SQL queries and results summaries in natural language.

The suggested architectural design

Implementing NL2SQL on Google Cloud architecture
Image credit to Google Cloud

Top NL2SQL techniques

For an advantage in your own NL2SQL endeavor, take a look at the following advice.

Start by determining which questions require attention: Depending on the final report’s goal, answering a question may seem straightforward, but getting the intended response and storyline frequently requires several steps of reasoning. Before your experiment, gather the expected natural language ground truth, SQL, and your query.

Data purification and preparation are essential, and using LLMs does not replace them. As needed, establish new table views and make sure that useful descriptions or metadata are used in place of business domain acronyms. Before going on to more complicated join-required questions, start with straightforward ones that just need one table.

Practice iteration and SQL refinement with user feedback: Google’s heuristic experiment demonstrates that iteration with feedback is more effective following the creation of an initial draft of your SQL.

For queries with multiple steps, use a custom flow: Multi-dimensional data explanations and automated insight development can be made possible by BigQuery contribution analysis models.

Next up?

A big step toward making data more accessible and useful for everyone is the combination of NL2SQL, LLMs, and data analytic methods. Enabling users to communicate with databases through natural language can democratize data access and analysis, opening up improved decision-making to a larger group of people in every company.

Data, size, and value can now be rationalized more easily than ever thanks to exciting new innovations like BigQuery contribution analysis and Gemini.

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes