Friday, March 28, 2025

Data Quality & Data Handling In Data Engineering With Gen AI

Generative AI models are subtly changing the way Google Cloud manage, analyse, and use data in the field of data engineering. Large language models (LLMs), for instance, can assist with data production, data quality, and data schema handling.

With automated solutions for schema management, data quality automation, and the creation of synthetic and structured data from various sources. It builds on the recently released Gemini in BigQuery Data preparation capabilities and offers useful examples and code snippets.

Data schema handling: Integrating new datasets

For any data engineering team, moving and maintaining data is a constant struggle. The procedure can be difficult and prone to mistakes, whether it involves transferring data across systems with different schemas or incorporating new information into already-existing data products. This is frequently made worse when working with legacy systems; in fact, according to Flexera’s 2024 State of the Cloud Report, 32% of organisations say that moving the data and app is their largest difficulty.

By helping to continuously automate schema mapping and transformation, Gen AI models provide a potent remedy. Consider transferring client information from an outdated CRM system to a new platform and using BigQuery to merge it with other external datasets. The schemas probably differ greatly, necessitating a complex field and data type mapping. By analysing both schemas and producing the required transformation logic, Gemini Google Cloud’s most powerful AI model family to date can drastically cut down on manual labour and the possibility of mistakes.

Developing a lightweight application that receives messages from Pub/Sub, retrieves pertinent dataset information from BigQuery and Cloud Storage, and uses the Vertex AI Gemini API to map source fields to target fields and assign a confidence score is a typical method of handling data schemas that Google Cloud has observed from data engineering teams.

Gemini gives each mapping a confidence level, which is subsequently saved in BigQuery. The data engineering team may examine the low-confidence mappings and confirm the high-confidence mappings once they are in BigQuery. If they feel comfortable doing so, they can eventually decide to fully automate these. Both batch and event-driven architectures are possible for this pipeline of gen AI activities. However, given the quick release cadence of advancements in gen AI models, a last stage is typically necessary, where a human approves the final output. Over time, this might become totally automated. This is an illustration of an architecture or workflow:

Illustration of an architecture or workflow
Image credit to Google Cloud

Data quality: Enhancing accuracy and consistency

In the data-driven world, low-quality data may cost companies millions of dollars. Bad data has serious repercussions, ranging from erroneous consumer insights that result in misdirected marketing efforts to defective financial reporting that influences investment choices. Beyond conventional rule-based systems, Gen AI models provide a fresh perspective on data quality by spotting minute discrepancies that might seriously disrupt your data pipelines. Consider, for instance, a system that can automatically identify and fix mistakes that would normally need hours of manual inspection or the development of complex ReGex expressions.

There are several ways in which Gemini might enhance your current data quality checks:

Deduplication

Imagine that you have to eliminate duplicate client profiles. Gemini may find duplicate names, addresses, and phone numbers with minor phrasing or formatting variations. Gemini knows “123 Main St.” and “123 Main Street” are the same or that “Robert Smith” and “Bob Smith” are linked. An LLM can offer a more straightforward and efficient solution than more conventional techniques like fuzzy matching, which are difficult to write and don’t always yield the best results.

Standardization

Gemini is very good at standardising data types. Gemini can be used with rapid engineering, RAG, or fine-tuning to comprehend and enforce data quality criteria in a more maintainable and human-readable manner rather than depending on complex regular expressions to check data types. This is especially helpful for areas where format differences might make analysis difficult, such as dates, timings, and addresses.

Subtle error detection

Gemini is able to spot minor discrepancies that conventional techniques could overlook. These consist of:

  • Abbreviation differences (e.g., “St.” vs. “Street”)
  • The same name spelt differently (for example, “Catherine” versus “Katherine”)
  • Using nicknames (such as “Bob” as opposed to “Robert”)
  • Phone numbers with incorrect formatting (such as missing area codes)
  • Using punctuation and capitalisation inconsistently

Let’s use address validation as a typical example to demonstrate this. Google Cloud wish to determine whether the address_state field in Google Cloud’s customer_addresses database is a legitimate US state and translate it into the common two-letter abbreviation:

 Customer_addresses
Image credit to Google Cloud

You may quickly spot certain problems with the address_state column by looking at the input data. For instance, ‘Texas’ is written out rather than utilising the conventional two-letter abbreviation, while ‘Pennsylvaniaa’ is misspelt. Because traditional data quality methods rely on exact matches or strict criteria, they may overlook these small deviations, even if these mistakes are evident to a human.

However, Gemini is well equipped for this duty because of its exceptional comprehension and interpretation of human language. Beyond strict guidelines and adjusting to the subtleties of human language, Gemini can reliably detect these discrepancies and standardise the state names into the appropriate format with a straightforward prompt.

The BQML function ML. GENERATE_TEXT, which enables you to do gen AI operations on data stored in BigQuery via a remote connection to Gemini housed at Vertex AI, is how you may utilise Gemini in BigQuery to accomplish this task:

Each address_state value is sent to Gemini by this code along with a prompt requesting that it standardise and validate the input. The original input, the standardised output, and a boolean indicating if the state is valid are then included in the JSON response that Gemini returns:

 JSON response that Gemini returns
Image credit to Google Cloud

In this case, Gemini has simplified and automated Google Cloud’s data quality procedure while also lowering the code’s complexity. The validation output is shown in the first column. By following a straightforward instruction, It can accurately determine whether rows have an incorrect state column value and format the state columns. This would have required the use of several SQL phrases, external APIs, or joins with lookup tables in the more conventional method.

The example above only scratches the surface of how Gemini might enhance the quality of data. But modern AI models are also quite good at more complex tasks than just standardisation and validation. For example, they can efficiently manage mixed-language text fields by identifying linguistic inconsistencies and categorising data problems by severity (low, medium, and high) for prioritised action.

Important considerations for large datasets

Sending individual queries to an LLM like Gemini can become wasteful and may result in usage quotas being exceeded when working with massive datasets. Make sure your GCP project has enough API limits and think about batching requests to maximise performance and control expenses.

Data generation: Unlocking insights from unstructured data

Images, movies, and PDFs are examples of unstructured data that include important information that has previously been challenging to convert into structured data use cases. Google Cloud’s ability to extract structured data for downstream use is made possible by Gemini’s multimodal, industry-leading context window of up to 2 million tokens.

Consistent data processing is hampered by the unreliability and hallucination-proneness of several general artificial intelligence models. You may leverage Vertex AI assessment services, controlled generation, grounding with Gemini, and Gemini’s system instructions to practically handle this. While controlled generation directs the models to output in a certain format, like JSON, and enforces structured outputs that follow a predetermined schema, system instructions direct the behaviour of the models.

By providing related quality measurements and explanations, evaluation enables you to automate the process of selecting the best response. Lastly, grounding lowers the possibility of the model creating content by tying the output to current, private or public facts. In order to assist guarantee consistency and dependability in business applications, the model’s structured data output may then be employed in data pipelines and machine learning processes, as well as connected with BigQuery for downstream analysis.

Selecting the appropriate model for a given job involves additional considerations. For instance, Gemini Pro’s 2M token context window could be necessary for longer films or unstructured content, whereas Gemini Flash’s 1M context window might be sufficient for other applications.

Gemini may also be used to create artificial data that replicates real-world situations, enhancing your datasets and enhancing model performance. Artificially created data that statistically resembles real-world data while protecting privacy by removing personally identifying information (PII) is known as synthetic data. With this method, businesses may create reliable machine learning models and insights based on data without the dangers and constraints that come with utilising real-world data. Synthetic data is becoming more and more popular because it can solve privacy issues, reduce data shortages, and make it easier to generate test data for a variety of sectors.

Going to production: DataOps and the LLM pipeline

You are prepared to include LLM-powered data engineering solutions into your production environment after they have been successfully deployed. The following are some issues that you may need to deal with:

Scheduling and automation

To assist guarantee ongoing data processing and analysis, use tools like Composer or Vertex AI Pipelines to plan and automate gen AI operations.

Model monitoring and evaluation

You can check accuracy, spot any biases, and initiate retraining when needed by putting in place an evaluation pipeline to track the performance of your gen AI models.

Version control

Utilise version control systems to monitor modifications and guarantee repeatability, treating Gemini prompts and configurations like code.

Transform your data engineering processes with gen AI

With its strong skills for managing schemas, improving data quality, creating synthetic data, and creating data from unstructured sources, Gen AI is revolutionising the field of data engineering. Prepare to extract unprecedented levels of accuracy, efficiency, and insight from your data by embracing these developments and using DataOps concepts. Begin experimenting with Gemini in your own data pipelines to discover the possibilities for improved business outcomes, insights from new data sources, and more consistency in data processing.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post