Tuesday, May 21, 2024

Google Cloud Introducing the Hive-BigQuery Connector

We are excited to announce the public GA release of the Hive-BigQuery Connector, an open-source solution that enables Apache Hive workloads to read from and write to BigQuery and BigLake tables. This connector addresses the needs of customers who are interested in migrating their data warehouse from Apache Hive to BigQuery but have faced challenges along the way. Whether you are looking to fully migrate or want both systems to coexist, the Hive-BigQuery Connector offers a wide range of use cases to suit your requirements.

What is the Hive-BigQuery Connector?

If you have experience running Hadoop or Spark workloads on Google Cloud, you might already be familiar with the Cloud Storage Connector and the Apache Spark SQL connector for BigQuery. These connectors allow storing and accessing data files in Cloud Storage and enable reading and writing data between BigQuery and Spark’s dataframes.

Similarly, the Hive-BigQuery Connector implements the Hive StorageHandler API, facilitating the integration of Hive workloads with BigQuery and BigLake tables. While Hive’s execution engine handles compute operations, such as aggregates and joins, the connector manages interactions with the data layer in BigQuery. This includes supporting data stored in either BigQuery’s native storage or open-source data formats in Cloud Storage buckets through a BigLake connection.

Apache Hive is a popular open-source data warehouse that provides an SQL-like interface for querying data stored in various databases and file systems integrated with Apache Hadoop. Over time, Hive has evolved to utilize cloud storage services, and the new connector simplifies migration by enabling Hive to integrate with native storage solutions like BigQuery.

Benefits of Cloud Data Warehouse Migration

Migrating a data warehouse to the cloud offers numerous benefits, including:

  1. Reduced costs: Pay for the resources you use, optimizing cost efficiency.
  2. Increased scalability: Easily scale up or down to meet changing needs.
  3. Improved reliability: Leverage redundant and highly-available systems.
  4. Enhanced security: Implement encryption for data in transit and at rest, and enforce granular access control.
  5. Expanded capabilities: Integrate with a wide range of Google Cloud native tools and solutions, such as BigQuery’s materialized views and BI Engine for improved performance, Pub/Sub for low-latency data transport, Dataflow for scalable data processing, and Vertex AI for machine learning model development and deployment.

BigQuery Migration Service

To facilitate the migration process, Google Cloud offers the BigQuery Migration Service—a comprehensive solution designed to accelerate the migration from Hive data warehouses to BigQuery. The service includes free-to-use tools that assist with assessment, planning, data transfer, and data validation. Notably, the BigQuery batch SQL translator and interactive SQL translator enable the translation of Hive queries into BigQuery’s ANSI-compliant SQL syntax, allowing queries to be executed natively within BigQuery’s execution engine.

Use Cases for the Hive-BigQuery Connector

The Hive-BigQuery Connector caters to various core use cases, including:

  1. Wholesale migration with continuity of operations: When migrating the entire Hive data warehouse to BigQuery, this use case ensures uninterrupted operations during the migration process. By moving the data to BigQuery first, you can allow original Hive queries to access the migrated data through the Connector while gradually translating them to BigQuery’s SQL dialect. Once the migration is complete, you can exclusively use BigQuery and retire Hive.
  2. Selective usage of BigQuery: If you prefer to continue using Hive for most workloads but want to leverage specific features of BigQuery, this use case allows for a unified environment. The Connector enables Hive to join its own tables with those managed by BigQuery, allowing selective usage of BigQuery for specific workloads that can benefit from its features, such as BI Engine or BigQuery ML.
  3. Full open-source software (OSS) stack: For those who want to maintain a full OSS stack for their data warehouse, the Connector supports the migration of data in its original OSS format (e.g., Avro, Parquet, or ORC) to Cloud Storage. Hive can continue to execute and process queries using its own SQL dialect, while the Connector enhances the OSS stack by utilizing BigLake and BigQuery features, such as metadata caching for query performance, Data Loss Prevention, column-level access control, and dynamic data masking for enhanced security and governance at scale.

Hive-BigQuery Connector Features

The Hive-BigQuery Connector, in its public preview release, offers several features, including:

  • Support for running queries with MapReduce and Tez execution engines
  • Creation and deletion of BigQuery tables from Hive
  • Joining BigQuery and BigLake tables with Hive tables
  • Fast reads from BigQuery tables using the Storage Read API streams and the Apache Arrow format
  • Two methods for writing data to BigQuery: direct writes using the BigQuery Storage Write API for low-latency workloads and indirect writes by staging temporary Avro files in Cloud Storage, then loading them into the destination table using the Load Job API for cost-efficient workloads
  • Access to BigQuery time-partitioned and clustered tables
  • Column pruning to retrieve only necessary columns from the data layer
  • Predicate pushdowns to pre-filter data rows at the BigQuery storage layer, improving query performance by reducing network data transfer
  • Automatic conversion of Hive data types to BigQuery data types

The Hive-BigQuery Connector has already proven its value in real-world scenarios, such as Flipkart’s data lake migration to Google Cloud. The flexibility provided by the connector allows queries on BigQuery data from Hive, providing the necessary interoperability while eliminating data duplication or silos across various data stores.

With the Hive-BigQuery Connector, users can seamlessly integrate Hive workloads with BigQuery and BigLake tables, enabling migration, coexistence, and interaction between the two systems. This open-source solution offers valuable use cases, benefits from cloud data warehousing, and enhances the capabilities of Apache Hive in the modern cloud era.

for more details click on Google Cloud

agarapuramesh
agarapurameshhttps://govindhtech.com
Agarapu Ramesh was founder of the Govindhtech and Computer Hardware enthusiast. He interested in writing Technews articles. Working as an Editor of Govindhtech for one Year and previously working as a Computer Assembling Technician in G Traders from 2018 in India. His Education Qualification MSc.
RELATED ARTICLES

11 COMMENTS

  1. […] The capacity to comprehend the state of a database, including its performance, health, and security, is referred to as observability. Observability is essential in any database, but it is extremely important when working at scale with a database service like Cloud Bigtable. […]

  2. […] Through Google BigQuery public datasets, which we extended with six more in 2019, Google Cloud collaborated with the community to democratize blockchain data at the beginning of 2018. Eleven more of the most popular blockchains have been added to the BigQuery public datasets today. Additionally, we’re updating the program’s already-existing datasets. […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes