Semi-structured data like JSON
Increased data generation in enterprises includes structured transactional data, semi-structured data like JSON, and unstructured data like images and audio. Unlock the power of semi-structured data with BigQuery’s JSON Type. Data processing, storage, and query engines must build custom transformation pipelines to handlesemi-structured data and unstructured data due to its diversity and volume.
This post will discuss BigQuery‘s architectural concepts forsemi-structured data JSON, which eliminates complex preprocessing and provides schema flexibility, intuitive querying, and structured data’s scalability. They will discuss storage format optimizations, architecture performance benefits, and how they affect JSON path billing.
Capacitor File Format integration
BigQuery’s storage architecture relies on columnar capacitor storage. This format stores exabytes of data and serves millions of queries after a decade of research and optimization. The capacitor is designed for structured data. Dictionary, RLE, Delta encoding, and others help Capacitor store column values optimally. It also reorders records to maximize RLE. Capacitor can permute rows to improve RLE effectiveness since table row order rarely matters. An embedded expression library uses columnar storage for block-oriented vectorized processing.
They created the next-generation BigQuery Capacitor format for sparse semi-structured data to natively support JSON.
JSON is shredded into virtual columns as much as possible during ingestion. Most JSON keys are written once per column, not per row. Column data excludes colons and whitespace. Putting values in columns lets us use semi-structured data encodings like Dictionary, Run Length, and Delta. This greatly reduces query-time storage and IO costs. The format natively understands JSON nulls and arrays, optimizing virtual column storage.
BigQuery’s native JSON data type understands JSON objects, arrays, scalar types, nulls (‘null’), and empty arrays to preserve its nested structure.
Capacitor, the native JSON data type’s file-format, reorders records to group similar data and types. Record-reordering maximizes Run Length Encoding across rows to reduce virtual columns. For a key with integer and string values across a range of rows in the file, record-reordering groups the rows with the string data type and the rows with the integer data type, resulting in run length encoded spans of missing values in both virtual columns and smaller columns.
Capacitor is optimized for structured datasets. This was difficult for JSON data, which has many shapes and types. They overcame these challenges while building the next-generation Capacitor that natively supported JSON.
- Add/Remove keys
- As optional elements, JSON keys are marked as missing in rows without them.
- Scalar Type Change
- Virtual columns store keys that change scalar types like string, int, bool, and float across rows.
- Non-scalar type changes
- Non-scalar values like object and array are stored in an optimized binary format for parsing.
- After shredding JSON data into virtual columns, the logical size of each column is calculated based on data size at ingestion.
Better query performance with JSON native data types
You would have to load the entire JSON STRING row from storage, decompress it, and evaluate each filter and projection expression one row at a time to filter or project specific JSON paths.
Unlike native JSON, BigQuery processes only the necessary virtual columns. They added compute and filter pushdown of JSON operations to improve projection and filter efficiency. Pushing projections and filter operations down to the embedded evaluation layer allows vectorized operations over virtual columns, making them more efficient than STRING type.
Customers are only charged for the size of the virtual columns scanned to return the JSON paths requested in the SQL query. In the query below, only the virtual columns representing the JSON keys `reference_id` and `id` are scanned across the data if payload is a JSON column with those keys.
On-demand billing for JSON shows the number of logical bytes scanned. Each virtual column has a native data type (INT64, FLOAT, STRING, BOOL), so the data size calculation is a sum of the scanned `reference_id` and `id` sizes, following the standard bigquery data type size.
Optimization virtual columns allow BigQuery Editions queries to use less IO and CPU than storing the JSON string unchanged because you’re scanning specific columns instead of loading the entire blob and extracting the paths.
BigQuery can now process only the SQL query-requested JSON paths with the new type. This can significantly lower query costs.