Thursday, December 19, 2024

BigQuery Tables For Apache Iceberg Optimize Open Lakehouse

- Advertisement -

BigQuery tables

Optimized storage for the open lakehouse using BigQuery tables for Apache Iceberg
BigQuery native tables have been supporting enterprise-level data management features including streaming ingestion, ACID transactions, and automated storage optimizations for a number of years. Open-source file formats like Apache Parquet and table formats like Apache Iceberg are used by many BigQuery clients to store data in data lakes.

Google Cloud introduced BigLake tables in 2022 so that users may take advantage of BigQuery’s security and speed while keeping a single copy of their data. BigQuery clients must manually arrange data maintenance and conduct data changes using external query engines since BigLake tables are presently read-only. The “small files problem” during ingestion presents another difficulty. Table writes must be micro-batched due to cloud object storage’ inability to enable appends, necessitating trade-offs between data integrity and efficiency.

- Advertisement -

Google Cloud provides the first look at BigQuery tables for Apache Iceberg, a fully managed storage engine from BigQuery that works with Apache Iceberg and offers capabilities like clustering, high-throughput streaming ingestion, and autonomous storage optimizations. It provide the same feature set and user experience as BigQuery native tables, but they store data in customer-owned cloud storage buckets using the Apache Iceberg format. Google’s are bringing ten years of BigQuery developments to the lakehouse using BigQuery tables for Apache Iceberg.

lakehouse using BigQuery tables for Apache Iceberg
Image Credit To Google Cloud

BigQuery’s Write API allows for high-throughput streaming ingestion from open-source engines like Apache Spark, and BigQuery tables for Apache Iceberg may be written from BigQuery using the GoogleSQL data manipulation language (DML). This is an example of how to use clustering to build a table:

CREATE TABLE mydataset.taxi_trips
CLUSTER BY vendor_id, pickup_datetime
WITH CONNECTION us.myconnection
OPTIONS (
storage_uri=’gs://mybucket/taxi_trips’,
table_format=’ICEBERG’,
file_format=’PARQUET’
)
AS SELECT * FROM bigquery-public-data.new_york_taxi_trips.tlc_green_trips_2020;

Fully managed enterprise storage for the lakehouse

Drawbacks of BigQuery tables for Apache Iceberg

The drawbacks of open-source table formats are addressed by BigQuery tables for Apache Iceberg. BigQuery handles table-maintenance duties automatically without requiring client labor when using BigQuery tables for Apache Iceberg. BigQuery automatically re-clusters data, collects junk from files, and combines smaller files into appropriate file sizes to keep the table optimized.

- Advertisement -

For example, the size of the table is used to adaptively decide the ideal file sizes. BigQuery tables for Apache Iceberg take use of more than ten years of experience in successfully and economically managing automatic storage optimization for BigQuery native tables. OPTIMIZE and VACUUM do not need human execution.

BigQuery tables for Apache Iceberg use Vortex, an exabyte-scale structured storage system that drives the BigQuery storage write API, to provide high-throughput streaming ingestion. Recently ingested tuples are persistently stored in a row-oriented manner in BigQuery tables for Apache Iceberg, which regularly convert them to Parquet. The open-source Spark and Flink BigQuery connections provide parallel readings and high-throughput ingestion. You may avoid maintaining custom infrastructure by using Pub/Sub and Datastream to feed data into BigQuery tables for Apache Iceberg.

Advantages of using BigQuery tables for Apache Iceberg

Table metadata is stored in BigQuery’s scalable metadata management system for Apache Iceberg tables. BigQuery handles metadata via distributed query processing and data management strategies, and it saves fine-grained information. since of this, BigQuery tables for Apache Iceberg may have a greater rate of modifications than table formats since they are not limited by the need to commit the information to object storage. The table information is tamper-proof and has a trustworthy audit history since authors are unable to directly alter the transaction log.

While expanding support for governance policy management, data quality, and end-to-end lineage via Dataplex, BigQuery tables for Apache Iceberg still support the fine-grained security rules imposed by the storage APIs.

BigQuery metastore
Image Credit To Google Cloud

BigQuery tables for Apache Iceberg are used to export metadata into cloud storage Iceberg snapshots. BigQuery metastore, a serverless runtime metadata service that was revealed earlier this year, will shortly register the link to the most recent exported information. Any engine that can comprehend Iceberg may query the data straight from Cloud Storage with to Iceberg metadata outputs.

Find out more

Clients such as HCA Healthcare, one of the biggest healthcare organizations globally, recognize the benefits of using BigQuery tables for Apache Iceberg as their BigQuery storage layer that is compatible with Apache Iceberg, opening up new lakehouse use-cases. All Google Cloud regions now provide a preview of the BigQuery tables for Apache Iceberg.

Can other tools query data stored in BigQuery tables for Apache Iceberg?

Yes, metadata is exported from Apache Iceberg BigQuery tables into cloud storage Iceberg snapshots. This promotes interoperability within the open data ecosystem by enabling any engine that can comprehend the Iceberg format to query the data straight from Cloud Storage.

How secure are BigQuery tables for Apache Iceberg?

The strong security features of BigQuery, including as fine-grained security controls enforced by storage APIs, are carried over into BigQuery tables for Apache Iceberg. Additionally, end-to-end lineage tracking, data quality control, and extra governance policy management layers are made possible via interaction with Dataplex.

- Advertisement -
Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes