BigQuery DataFrame And Gretel Verify Synthetic Data Privacy

November 5, 2024

143

It looked at how combining Gretel with BigQuery DataFrame simplifies synthetic data production while maintaining data privacy in the useful guide to synthetic data generation with Gretel and BigQuery DataFrames. In summary, BigQuery DataFrame is a Python client for BigQuery that offers analysis pushed down to BigQuery using pandas-compatible APIs.

Gretel provides an extensive toolkit for creating synthetic data using state-of-the-art machine learning methods, such as large language models (LLMs). An seamless workflow is made possible by this integration, which makes it simple for users to move data from BigQuery to Gretel and return the created results to BigQuery.

The technical elements of creating synthetic data to spur AI/ML innovation are covered in detail in this tutorial, along with tips for maintaining high data quality, protecting privacy, and adhering to privacy laws. In Part 1, to de-identify the data from a BigQuery patient records table, and in Part 2, it create synthetic data to be saved back to BigQuery.

Setting the stage: Installation and configuration

With BigFrames already installed, you may begin by using BigQuery Studio as the notebook runtime. To presume you are acquainted with Pandas and have a Google Cloud project set up.

Step 1: Set up BigQuery DataFrame and the Gretel Python client.
Step 2: Set up BigFrames and the Gretel SDK: To use their services, you will want a Gretel API key. One is available on the Gretel console.

Part 1: De-identifying and processing data with Gretel Transform v2

De-identifying personally identifiable information (PII) is an essential initial step in data anonymization before creating synthetic data. For these and other data processing tasks, Gretel Transform v2 (Tv2) offers a strong and expandable framework.

Tv2 handles huge datasets efficiently by combining named entity recognition (NER) skills with sophisticated transformation algorithms. Tv2 is a flexible tool in the data preparation pipeline as it may be used for preprocessing, formatting, and data cleaning in addition to PII de-identification. Study up on Gretel Transform v2.

Step 1: Convert your BigQuery table into a BigFrames DataFrame.

A portion of the DataFrame that it will be transforming is shown in the table below. Based on the value of the sex column, to generate new first and last names after hashing the patient_id column.

patient_id         first_name  last_name  sex     race

pmc-6545753-1      Antonio     Fernandez  Male    Hispanic

pmc-6192350-1      Ana         Silva      Female  Other

pmc-6332555-4      Lina        Chan       Female  Asian

pmc-6089485-1      Omar        Hassan     Male    Black or African American

pmc-6100673-1      Aisha       Khan       Female  Asian

Step 2: Work with Gretel to transform the data.

Step 3: Explore the de-identified data.

# Take a look at the newly transformed BigFrames DataFrame

transformed_df = transform_results.transformed_df

transformed_df.peek()

A comparison of the original and de-identified data may be seen below.

Original:

patient_id         first_name  last_name   sex     race

pmc-6545753-1      Antonio     Fernandez   Male    Hispanic

pmc-6192350-1      Ana         Silva       Female  Other

pmc-6332555-4      Lina        Chan        Female  Asian

pmc-6089485-1      Omar        Hassan      Male    Black or African American

pmc-6100673-1      Aisha       Khan        Female  Asian

De-identified:

patient_id         first_name  last_name  sex     race

389b63f369         John        Hampton    Male    Hispanic

eff31024e6         Christine   Carlson    Female  Other

8af37475b6         Sarah       Moore      Female  Asian

7bd5f08fb8         Russell     Zhang      Male    Black or African American

1628622e23         Stacy       Wilkinson  Female  Asian

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)

Gretel Navigator Fine Tuning (NavFT) refines pre-trained models on your datasets to provide high-quality, domain-specific synthetic data. Important characteristics include:

Manages a variety of data formats, including time series, JSON, free text, category, and numerical.
Maintains intricate connections between rows and data kinds.
May provide significant novel patterns, which might enhance the performance of ML/AI tasks.
Combines privacy protection with data usefulness.

By utilizing the advantages of domain-specific pre-trained models, NavFT expands on Gretel Navigator’s capabilities and makes it possible to create synthetic data that captures the subtleties of your particular data, such as the distributions and correlations for numeric, categorical, and other column types.

Using the de-identified data from Part 1, it will refine a Gretel model in this example.

Step 1: Make a model better:

# Display the full report within this notebook

train_results.report.display_in_notebook()

Step 2: Retrieve the Quality Report for Gretel Synthetic Data:

Step 3: Create synthetic data using the optimized model, assess the privacy and quality of the data, and then publish the results back to a BQ table.

A few things to note about the synthetic data:

Semantically accurate, the different modalities (free text, JSON structures) are completely synthetic and retained.
The data are grouped by patient during creation due to the group-by/order-by hyperparameters that were used during fine-tuning.

How to use BigQuery with Gretel

This technical manual offers a starting point for creating and using synthetic data using Gretel AI and BigQuery DataFrame. You may use the potential of synthetic data to improve your data science, analytics, and artificial intelligence development processes while maintaining data privacy and compliance by examining the Gretel documentation and using these examples.

BigQuery DataFrame And Gretel Verify Synthetic Data Privacy

Setting the stage: Installation and configuration

Part 1: De-identifying and processing data with Gretel Transform v2

Part 2: Generating synthetic data with Navigator Fine Tuning (LLM-based)

How to use BigQuery with Gretel

Google Cortex Framework helps Mars Wrigley With agile media

AWS AppSync API Allows Namespace Data Source Connectors

IaC Generator To Import SCPs And RCPs Into CloudFormation

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

Google Cortex Framework helps Mars Wrigley With agile media

IQM Spark Ignites Quantum era for Students and Researchers

AWS AppSync API Allows Namespace Data Source Connectors

Google Cloud DORA Report: Gen AI In Software Development

BigQuery Data Canvas: Now More Powerful for Faster Insights

Alabama Annealing Quantum Computing With D-Wave, Davidson

About Us

POPULAR CATEGORY