The new capability of Amazon Managed Streaming for Apache Kafka (Amazon MSK), which enables continuous data loading from an Apache Kafka cluster to Amazon Simple Storage Service (Amazon S3), is something Amazon was excited to unveil currently. Amazon read data from a Kafka topic, transform the records, and write them to an Amazon S3 destination using the extract, transform, and load (ETL) function provided by Amazon Kinesis Data Firehose. Kinesis Data Firehose is completely controlled, and you may configure it in the console with just a few clicks. There is no requirement for infrastructure or coding.
Kafka is frequently used to construct real-time data pipelines that dependably transfer enormous quantities of data across applications or systems. It offers a publish-subscribe messaging system that is both extremely scalable and fault-tolerant. In order to capture streaming data, including click-stream events, transactions, IoT events, and application and machine logs, many AWS customers have adopted Kafka. These customers also have applications that perform real-time analytics, carry out continuous transformations, and distribute this data in real-time to data lakes and databases.
Deploying Kafka clusters, though, is not without difficulties
Deploying, setting up, and maintaining the Kafka cluster itself presents the first difficulty. Amazon did it in May 2019 by releasing Amazon MSK. Apache Kafka’s setup, scaling, and management are made easier using MSK. You are free to concentrate on your data and applications because we take care of the infrastructure. Writing, deploying, and managing application code that uses Kafka data is the second barrier. The Kafka Connect framework is often used to create connectors, which must then be deployed, managed, and maintained in order for them to function. To ensure that no data is lost during the transfer out of Kafka, you must additionally code the infrastructure, data transformation and compression logic, error management, and retry logic.
They now make available a fully managed method using Amazon Kinesis Data Firehose to transfer data from Amazon MSK to Amazon S3. The solution is code-free and serverless, meaning there is no server infrastructure to maintain. Using the console, the data transformation and error-handling algorithms may be set up quickly.
The data source is Amazon MSK, the data destination is Amazon S3, and the data transfer logic is handled by Amazon Kinesis Data Firehose.
With this new feature, you can read data from Amazon MSK, change it, and then upload the modified data to Amazon S3 without having to write any code. Kinesis Data Firehose controls the procedures for reading, transforming and compressing data, and writing it to Amazon S3. In case something goes wrong, it also handles the logic for errors and retries. When a record cannot be processed, the system sends it to the S3 bucket of your choice for manual review. The infrastructure needed to handle the data stream is likewise managed by the system. It will automatically scale out and scale in according to the amount of data being transferred. You don’t need to perform any provisioning or maintenance tasks.
Amazon MSK provisioned or serverless clusters are supported by both public and private Kinesis Data Firehose delivery streams. Additionally, it allows cross-account connections for writing to S3 buckets in various AWS accounts and reading from an MSK cluster. The Data Firehose delivery stream retrieves data from your MSK cluster, buffers it for a threshold size and time that you specify, and then uploads the single file of buffered data to Amazon S3. Data Firehose can provide data to Amazon S3 buckets in various AWS Regions, however MSK and Data Firehose must be in the same AWS Region.
Data types can also be converted using Kinesis Data Firehose delivery streams. The formats JSON to Apache Parquet and Apache ORC are supported by built-in transformations. On Amazon S3, these columnar data formats allow for quicker queries and space savings. In order to convert non-JSON data to Apache Parquet/ORC, you can utilize AWS Lambda to convert input formats like CSV, XML, or structured text into JSON. Additionally, you have the option of sending the data to Amazon S3 in its raw form or by specifying a data compression type from Data Firehose, such as GZIP, ZIP, or SNAPPY.
[…] case 2: : Lowering costs for a multi-replica Kafka with up to 53% in cost […]
[…] addition to the Amazon EKS dashboard, you can install an AWS managed collection via the AWS Command Line Interface (AWS CLI) or APIs. This method is helpful if you wish to change […]
[…] a little leeway on the 2023 and 2024 dates, these are the key actions Amazon have taken and intend to take in order to establish IMDSv2 as the standard option for new AWS […]