We are excited to announce the public GA release of the Hive-BigQuery Connector, an open-source solution that enables Apache Hive workloads to read from and write to BigQuery and BigLake tables. This connector addresses the needs of customers who are interested in migrating their data warehouse from Apache Hive to BigQuery but have faced challenges along the way. Whether you are looking to fully migrate or want both systems to coexist, the Hive-BigQuery Connector offers a wide range of use cases to suit your requirements.
What is the Hive-BigQuery Connector?
If you have experience running Hadoop or Spark workloads on Google Cloud, you might already be familiar with the Cloud Storage Connector and the Apache Spark SQL connector for BigQuery. These connectors allow storing and accessing data files in Cloud Storage and enable reading and writing data between BigQuery and Spark’s dataframes.
Similarly, the Hive-BigQuery Connector implements the Hive StorageHandler API, facilitating the integration of Hive workloads with BigQuery and BigLake tables. While Hive’s execution engine handles compute operations, such as aggregates and joins, the connector manages interactions with the data layer in BigQuery. This includes supporting data stored in either BigQuery’s native storage or open-source data formats in Cloud Storage buckets through a BigLake connection.
Apache Hive is a popular open-source data warehouse that provides an SQL-like interface for querying data stored in various databases and file systems integrated with Apache Hadoop. Over time, Hive has evolved to utilize cloud storage services, and the new connector simplifies migration by enabling Hive to integrate with native storage solutions like BigQuery.
Benefits of Cloud Data Warehouse Migration
Migrating a data warehouse to the cloud offers numerous benefits, including:
- Reduced costs: Pay for the resources you use, optimizing cost efficiency.
- Increased scalability: Easily scale up or down to meet changing needs.
- Improved reliability: Leverage redundant and highly-available systems.
- Enhanced security: Implement encryption for data in transit and at rest, and enforce granular access control.
- Expanded capabilities: Integrate with a wide range of Google Cloud native tools and solutions, such as BigQuery’s materialized views and BI Engine for improved performance, Pub/Sub for low-latency data transport, Dataflow for scalable data processing, and Vertex AI for machine learning model development and deployment.
BigQuery Migration Service
To facilitate the migration process, Google Cloud offers the BigQuery Migration Service—a comprehensive solution designed to accelerate the migration from Hive data warehouses to BigQuery. The service includes free-to-use tools that assist with assessment, planning, data transfer, and data validation. Notably, the BigQuery batch SQL translator and interactive SQL translator enable the translation of Hive queries into BigQuery’s ANSI-compliant SQL syntax, allowing queries to be executed natively within BigQuery’s execution engine.
Use Cases for the Hive-BigQuery Connector
The Hive-BigQuery Connector caters to various core use cases, including:
- Wholesale migration with continuity of operations: When migrating the entire Hive data warehouse to BigQuery, this use case ensures uninterrupted operations during the migration process. By moving the data to BigQuery first, you can allow original Hive queries to access the migrated data through the Connector while gradually translating them to BigQuery’s SQL dialect. Once the migration is complete, you can exclusively use BigQuery and retire Hive.
- Selective usage of BigQuery: If you prefer to continue using Hive for most workloads but want to leverage specific features of BigQuery, this use case allows for a unified environment. The Connector enables Hive to join its own tables with those managed by BigQuery, allowing selective usage of BigQuery for specific workloads that can benefit from its features, such as BI Engine or BigQuery ML.
- Full open-source software (OSS) stack: For those who want to maintain a full OSS stack for their data warehouse, the Connector supports the migration of data in its original OSS format (e.g., Avro, Parquet, or ORC) to Cloud Storage. Hive can continue to execute and process queries using its own SQL dialect, while the Connector enhances the OSS stack by utilizing BigLake and BigQuery features, such as metadata caching for query performance, Data Loss Prevention, column-level access control, and dynamic data masking for enhanced security and governance at scale.
Hive-BigQuery Connector Features
The Hive-BigQuery Connector, in its public preview release, offers several features, including:
- Support for running queries with MapReduce and Tez execution engines
- Creation and deletion of BigQuery tables from Hive
- Joining BigQuery and BigLake tables with Hive tables
- Fast reads from BigQuery tables using the Storage Read API streams and the Apache Arrow format
- Two methods for writing data to BigQuery: direct writes using the BigQuery Storage Write API for low-latency workloads and indirect writes by staging temporary Avro files in Cloud Storage, then loading them into the destination table using the Load Job API for cost-efficient workloads
- Access to BigQuery time-partitioned and clustered tables
- Column pruning to retrieve only necessary columns from the data layer
- Predicate pushdowns to pre-filter data rows at the BigQuery storage layer, improving query performance by reducing network data transfer
- Automatic conversion of Hive data types to BigQuery data types
The Hive-BigQuery Connector has already proven its value in real-world scenarios, such as Flipkart’s data lake migration to Google Cloud. The flexibility provided by the connector allows queries on BigQuery data from Hive, providing the necessary interoperability while eliminating data duplication or silos across various data stores.
With the Hive-BigQuery Connector, users can seamlessly integrate Hive workloads with BigQuery and BigLake tables, enabling migration, coexistence, and interaction between the two systems. This open-source solution offers valuable use cases, benefits from cloud data warehousing, and enhances the capabilities of Apache Hive in the modern cloud era.
for more details click on Google Cloud
[…] the interpretability of the image classification, the team employed Explainable AI, a feature of Vertex AI. Explainable AI generates feature attributions or importance values that illustrate how much each […]
[…] The capacity to comprehend the state of a database, including its performance, health, and security, is referred to as observability. Observability is essential in any database, but it is extremely important when working at scale with a database service like Cloud Bigtable. […]
[…] IL5 since Google Cloud’s 2022 announcement on DoD IL5 workloads and our pledge to rapidly grow Google Cloud services with IL5 permission. We will keep expanding those services and provide updates as more become […]
[…] With Kentik, Google Cloud customers can reference live, always-up-to-date visualisations of Google Cloud infrastructure topology along with private-cloud and intra-cloud connectivity; create custom […]
[…] will now have a verified status symbol added to your Google Cloud community profile if you have been designated as an Innovator. This will help you stand out from […]
[…] Native BigQuery integration: The platform’s Mambu users gather a ton of data that can be used for analytics, customization, and other use cases. In order to help customers make better use of their priceless data, we intend to develop a smooth integration for transferring core banking data from Mambu into BigQuery. […]
[…] solve data quality and observability issues. Dataproc, Pub/Sub, Google Kubernetes Engine (GKE), and BigQuery were the most cost-effective and performant managed and serverless components, so the company chose […]
[…] Through Google BigQuery public datasets, which we extended with six more in 2019, Google Cloud collaborated with the community to democratize blockchain data at the beginning of 2018. Eleven more of the most popular blockchains have been added to the BigQuery public datasets today. Additionally, we’re updating the program’s already-existing datasets. […]
[…] this model type to Looker customers since it is one of the most popular model types available in BigQuery ML. You need just pick a Looker Explore, decide on any measure, and then choose the temporal […]
[…] are thrilled to announce the availability of additional SQL functions for BigQuery JSON today, enhancing the flexibility and power of our foundational JSON support. Complex data […]
[…] post will discuss BigQuery‘s architectural concepts forsemi-structured data JSON, which eliminates complex preprocessing […]