Dominate Structured and Semi-Structured Data Explosion

November 17, 2023

109

Page Contents

Semi-structured data like JSON

Increased data generation in enterprises includes structured transactional data, semi-structured data like JSON, and unstructured data like images and audio. Unlock the power of semi-structured data with BigQuery’s JSON Type. Data processing, storage, and query engines must build custom transformation pipelines to handlesemi-structured data and unstructured data due to its diversity and volume.

This post will discuss BigQuery‘s architectural concepts forsemi-structured data JSON, which eliminates complex preprocessing and provides schema flexibility, intuitive querying, and structured data’s scalability. They will discuss storage format optimizations, architecture performance benefits, and how they affect JSON path billing.

Capacitor File Format integration

BigQuery’s storage architecture relies on columnar capacitor storage. This format stores exabytes of data and serves millions of queries after a decade of research and optimization. The capacitor is designed for structured data. Dictionary, RLE, Delta encoding, and others help Capacitor store column values optimally. It also reorders records to maximize RLE. Capacitor can permute rows to improve RLE effectiveness since table row order rarely matters. An embedded expression library uses columnar storage for block-oriented vectorized processing.

They created the next-generation BigQuery Capacitor format for sparse semi-structured data to natively support JSON.

JSON is shredded into virtual columns as much as possible during ingestion. Most JSON keys are written once per column, not per row. Column data excludes colons and whitespace. Putting values in columns lets us use semi-structured data encodings like Dictionary, Run Length, and Delta. This greatly reduces query-time storage and IO costs. The format natively understands JSON nulls and arrays, optimizing virtual column storage.

BigQuery’s native JSON data type understands JSON objects, arrays, scalar types, nulls (‘null’), and empty arrays to preserve its nested structure.

Capacitor, the native JSON data type’s file-format, reorders records to group similar data and types. Record-reordering maximizes Run Length Encoding across rows to reduce virtual columns. For a key with integer and string values across a range of rows in the file, record-reordering groups the rows with the string data type and the rows with the integer data type, resulting in run length encoded spans of missing values in both virtual columns and smaller columns.

Capacitor is optimized for structured datasets. This was difficult for JSON data, which has many shapes and types. They overcame these challenges while building the next-generation Capacitor that natively supported JSON.

Add/Remove keys
As optional elements, JSON keys are marked as missing in rows without them.
Scalar Type Change
Virtual columns store keys that change scalar types like string, int, bool, and float across rows.
Non-scalar type changes
Non-scalar values like object and array are stored in an optimized binary format for parsing.
After shredding JSON data into virtual columns, the logical size of each column is calculated based on data size at ingestion.

Better query performance with JSON native data types

You would have to load the entire JSON STRING row from storage, decompress it, and evaluate each filter and projection expression one row at a time to filter or project specific JSON paths.

Unlike native JSON, BigQuery processes only the necessary virtual columns. They added compute and filter pushdown of JSON operations to improve projection and filter efficiency. Pushing projections and filter operations down to the embedded evaluation layer allows vectorized operations over virtual columns, making them more efficient than STRING type.

Customers are only charged for the size of the virtual columns scanned to return the JSON paths requested in the SQL query. In the query below, only the virtual columns representing the JSON keys `reference_id` and `id` are scanned across the data if payload is a JSON column with those keys.

On-demand billing for JSON shows the number of logical bytes scanned. Each virtual column has a native data type (INT64, FLOAT, STRING, BOOL), so the data size calculation is a sum of the scanned `reference_id` and `id` sizes, following the standard bigquery data type size.

Optimization virtual columns allow BigQuery Editions queries to use less IO and CPU than storing the JSON string unchanged because you’re scanning specific columns instead of loading the entire blob and extracting the paths.

BigQuery can now process only the SQL query-requested JSON paths with the new type. This can significantly lower query costs.

2 COMMENTS

Unlocking Secrets: Decoding Rapid Data Growth & Security! November 20, 2023 At 11:09 am
[…] Data Growth & […]
Log in to leave a comment
Google Cloud To Neo4j Dataflow Template Migration January 15, 2024 At 1:31 pm
[…] explain how the Google Cloud to Neo4j Dataflow template simplifies the process of moving data from BigQuery and Cloud Storage to Neo4j’s Aura DB, which is a cloud graph database service that is fully […]
Log in to leave a comment

Dominate Structured and Semi-Structured Data Explosion

Semi-structured data like JSON

Capacitor File Format integration

Better query performance with JSON native data types

Modern Art of Bahia Museum’s Unique Heritage Collection

Fitbit Sleep Data Links Health And Sleep In A Recent Study

Huawei Watch GT 5: Redefining Smartwatch Excellence

2 COMMENTS

LEAVE A REPLY Cancel reply

Recent Posts

Modern Art of Bahia Museum’s Unique Heritage Collection

Fitbit Sleep Data Links Health And Sleep In A Recent Study

Huawei Watch GT 5: Redefining Smartwatch Excellence

Gemini’s Big Upgrade: 1.5 Flash, Faster Replies, More Access

Precision 7960 Tower & LLMs In Dell Precision Workstations

Updates to Azure AI, Phi 3 Fine tuning, And gen AI models

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

Cardea Z540 SSD Revolutionizes Storage

What is Azure Policy in Microsoft Azure

MSI Motherboards with Intel Application Optimization

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

About Us

POPULAR CATEGORY