Sunday, July 7, 2024

Apache Iceberg Integration in AWS Glue Data Catalog

AWS Glue Data Catalog with Apache Iceberg

AWS are now offering the option for transactional tables in the Apache Iceberg format to be automatically compacted using a new feature of AWS Glue Data Catalog. Your transactional data lake tables can remain performant as a result.

Large volumes of unstructured, semi-structured, or raw data could be inexpensively stored in data lakes, which were first used for big data and analytics use cases. Organizations have realized that data lakes can be used for more than just reporting, which has led to an evolution in the amount of use cases that may be implemented. To guarantee data consistency, transactional capabilities must be included in data lakes.

Additionally, data lakes are essential for data quality, governance, and compliance. This is because data lakes include ever-increasing amounts of vital corporate data, which frequently needs to be updated or deleted. Additionally, data-driven businesses must maintain near-real-time synchronization between their back-end analytics systems and customer apps. To enable concurrent writes and reads without compromising data integrity, this scenario necessitates transactional capabilities on your data lake. Lastly, data lakes are now integration sites that require transactions to provide secure and dependable data transfer between different sources.

An open table format (OTF), like Apache Iceberg, was used by businesses to allow transactional semantics on data lake tables. Using OTF formats has its own set of challenges, such as managing a large number of small files on Amazon Simple Storage Service (Amazon S3) as each transaction creates a new file or managing object and meta-data versioning at scale.

Existing data lake tables can also be transformed from Parquet or Avro formats to an OTF format. To overcome these issues, organizations usually create and manage their own data pipelines, which results in more infrastructure work that is not differentiated. In order to run your code, you must develop code, deploy Spark clusters, grow the cluster, handle failures, and so on.

Speaking with AWS customers has shown us that the hardest part is getting all the little files that are created by every transactional write on tables combined into a limited number of huge files. Large files load and scan more quickly, which speeds up the execution of your analytics operations and queries. Compaction makes larger-sized files store more efficiently in tables. It switches the table’s storage from having a lot of little files to a few big files. Performance is enhanced, network round trips to S3 are decreased, and metadata overhead is decreased. Because the queries take less compute power to run, the performance boost is also advantageous to the cost of usage when using engines that charge for computation.

However, creating unique pipelines to optimize and compress Iceberg tables is costly and time-consuming. It is your responsibility to oversee the scheduling, provision of infrastructure, planning, and compaction job monitoring. For this reason, Amazon are introducing automatic compaction now.

Things to consider

Here are a few more things to share with you as AWS debut this new feature today:

  • Files cannot be deleted or merged by compaction. Data files that have delete files linked with them will be skipped, but tables containing removed data will be compacted.
  • It is not supported to use S3 buckets that are set up for exclusive access from a VPC via VPC endpoints.
  • Compacting Apache Iceberg tables with data stored in Apache Parquet is possible.
  • Compaction is applicable to buckets encrypted using either KMS controlled keys (SSE-KMS) or the standard server-side encryption (SSE-S3).

Accessibility

There are locations for this new capacity in US West (Oregon), US East (Ohio, N. Virginia), Asia Pacific (Tokyo), and Europe (Ireland).

The data processing unit (DPU), a proportional indicator of processing capability made up of four virtual CPUs with a compute capacity of sixteen gigabytes each, is used to determine pricing. There is a fee per DPU/hour, with a minimum of one minute, and it is measured by the second.

It’s time to move to this new fully managed capability right away and decommission your current compaction data pipeline.

RELATED ARTICLES

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes