BigQuery to Bigtable
AI and real-time data integration in today’s applications have brought data analytics platforms like BigQuery into operational systems, blurring the lines between databases and analytics. Customers prefer BigQuery for effortlessly integrating many data sources, enriching data with AI and ML, and directly manipulating warehouse data with Pandas. They also say they need to make BigQuery pre-processed data available for quick retrieval in an operational system that can handle big datasets with millisecond query performance.
The EXPORT DATA to Bigtable (reverse ETL) tool is now generally accessible to bridge analytics and operational systems and provide real-time query latency. Now, anyone who can write SQL can quickly translate their BigQuery analysis into Bigtable’s highly performant data format, access it with single-digit millisecond latency, high QPS, and replicate it globally to be closer to consumers.
Three architectures and use cases that benefit from automated on-demand BigQuery to Bigtable data exports are described in this blog:
- Real-time application serving
- Enriched streaming data for ML
- Backloading data sketches to build real-time metrics that rely on big data
Real-time application serving
Bigtable enhances BigQuery for real-time applications. BigQuery’s storage format optimizes counting and aggregation OLAP queries. BigQuery BI Engine intelligently caches your most frequently used data to speed up ad-hoc analysis for real-time applications. Text lookups using BigQuery search indexes can also find rows without keys that require text filtering, including JSON.
BigQuery, a diverse analytics platform, is not geared for real-time application serving like Bigtable. Multiple columns in a row or range of rows can be difficult to access with OLAP-based storage. Bigtable excels in data storage, making it ideal for operational applications.
If your application needs any of the following, use Bigtable as a serving layer:
- Row lookups with constant and predictable response times in single-digit milliseconds
- High query per second (linearly scales with nodes)
- Application writes with low latency
- Global installations (automatatic data replication near users)
Reverse ETL reduces query latency by effortlessly moving warehouse table data to real-time architecture.
Step 1: Set up Bigtable and service table
Follow the instructions to build a Bigtable instance, a container for Bigtable data. You must choose SSD or HDD storage while creating this instance. SSD is faster and best for production, while HDD can save money if you’re simply learning Bigtable. You create your first cluster when you create an instance. This cluster must be in the same region as the BigQuery dataset you’re loading. However, you can add clusters in other regions that automatically receive data from BigQuery’s writing cluster.
Create your Bigtable table, which is the BigQuery sink in the reverse ETL process, after your instance and cluster are ready. Choose Tables in the left navigation panel and Create Table from the top of the Tables screen from the console.
Simply name the Table ID BQ_SINK and hit create on the Create a Table page. The third step was to enable BigQuery Reverse ETL construct column families.
You can also connect to your instance via CLI and run cbt createtable BQ-SINK.
Step 2: Create a BigQuery Reverse ETL application profile
Bigtable app profiles manage request handling. Consider isolating BigQuery data export in its own app profile. Allow single-cluster routing in this profile to place your data in the same region as BigQuery. It should also be low priority to avoid disrupting your main Bigtable application flow.
This gcloud command creates a Bigtable App Profile with these settings:
gcloud bigtable app-profiles create BQ_APP_PROFILE \
–project=[PROJECT_ID] \
–instance=[INSTANCE_ID]\
–description=”Profile for BigQuery Reverse ETL” \
–route-to=[CLUSTER_IN_SAME_REGION_AS_BQ_DATASET] \
–transactional-writes \
–priority=PRIORITY_LOW
After running this command, Bigtable should show it under the Application profiles area.
Step 3: SQL-export application data
Let’s analyze BigQuery and format the results for its artwork application. BigQuery public datasets’ the_met.objects table will be used. This table contains structured metadata about each Met artwork. It want to create two main art application elements:
- Artist profile: A succinct, structured object with artist information for fast retrieval in our program.
- Gen AI artwork description: Gemini builds a narrative description of the artwork using metadata from the table and Google Search for context.
Gemini in BigQuery setup
For your first time utilizing Gemini with BigQuery, set up the integration. Start by connecting to Vertex AI using these steps. Use the following BigQuery statement to link a dataset model object to the distant Vertex connection:
CREATE MODEL [DATASET].model_cloud_ai_gemini_pro
REMOTE WITH CONNECTION us.bqml_llm_connection
OPTIONS(endpoint = ‘gemini-pro’);
Step 4: GoogleSQL query Bigtable’s low-latency serving table
Its mobile app can use pre-processed artwork data. The Bigtable console’s left-hand navigation menu offers Bigtable Studio and Editor. Use this SQL to test your application’s low-latency serving query.
select _key, artist_info,
generated_description[‘ml_generate_text_llm_result’] as generated_description
from BQ_SINK
This Bigtable SQL statement delivers an artist profile as a single object and a produced text description field, which your application needs. This serving table can be integrated using Bigtable client libraries for C++, C#, Go, Java, HBase, Node.js, PHP, Python, and Ruby.
Enriching streaming ML data using Dataflow and Bigtable
Another prominent use case for BigQuery-Bigtable Reverse ETL is feeding ML inference models historical data like consumer purchase history from Bigtable. BigQuery’s history data can be used to build models for recommendation systems, fraud detection, and more. Knowing a customer’s shopping cart or if they viewed similar items might add context to clickstream data used in a recommendation algorithm. Identification of a fraudulent in-store credit card transaction requires more information than the current transaction, such as the prior purchase’s location, recent transaction count, or travel notice status. Bigtable lets you add historical data to Kafka or PubSub event data in real time at high throughput.
Use Bigtable’s built-in Enrichment transform with Dataflow to do this. You can build these architectures with a few lines of code!
Data sketch backloading
A data sketch is a brief summary of a data aggregation that contains all the information needed to extract a result, continue it, or integrate it with another sketch for re-aggregate. Bigtable’s conflict-free replicated data types (CRDT) help count data across a distributed system in data drawings. This is essential for real-time event stream processing, analytics, and machine learning.
Traditional distributed system aggregations are difficult to manage since speed typically compromises accuracy and vice versa. Distributed counting is efficient and accurate with Bigtable aggregate data types. These customized column families allow each server to update its local counter independently without performance-hindering locks, employing mathematical features to ensure these updates converge to the correct final value regardless of order. These aggregation data types are necessary for fraud detection, personalization, and operational reporting.
These data types seamlessly connect with BigQuery’s EXPORT DATA capability and BigQuery Data Sketches (where the same sketch type is available in Bigtable). This is important if you wish to backload your first application with previous data or update a real-time counter with updates from a source other than streaming ingestion.
Just add an aggregate column family with a command and export the data to leverage this functionality. Sample code from app:
On Bigtable, you may add real-time updates to this batch update and execute the HLL_COUNT.EXTRACT SQL function on the data sketch to estimate artist counts using BigQuery’s historical data.
What next?
Reverse ETL between BigQuery and Bigtable reduces query latency in real-time systems, but more is needed! it is working on real-time architecture data freshness with continuous queries. Continuous queries enable you to duplicate BigQuery data into Bigtable and other sources while in preview. StreamingDataFrames can be used with Python transformations in BigFrames, ready for testing.