Stream real-time knowledge into Apache Iceberg tables in Amazon S3 utilizing Amazon Information Firehose

November 6, 2024

27

As companies generate extra knowledge from quite a lot of sources, they want techniques to successfully handle that knowledge and use it for enterprise outcomes—resembling offering higher buyer experiences or decreasing prices. We see these developments throughout many industries—on-line media and gaming firms offering suggestions and customised promoting, factories monitoring gear for upkeep and failures, theme parks offering wait instances for well-liked sights, and lots of others.

To construct such purposes, engineering groups are more and more adopting two developments. First, they’re changing batch knowledge processing pipelines with real-time streaming, so purposes can derive perception and take motion inside seconds as an alternative of ready for each day or hourly batch change, remodel, and cargo (ETL) jobs. Second, as a result of conventional knowledge warehousing approaches are unable to maintain up with the quantity, velocity, and number of knowledge, engineering groups are constructing knowledge lakes and adopting open knowledge codecs resembling Parquet and Apache Iceberg to retailer their knowledge. Iceberg brings the reliability and ease of SQL tables to Amazon Easy Storage Service (Amazon S3) knowledge lakes. By utilizing Iceberg for storage, engineers can construct purposes utilizing completely different analytics and machine studying frameworks resembling Apache Spark, Apache Flink, Presto, Hive, or Impala, or AWS providers resembling Amazon SageMaker, Amazon Athena, AWS Glue, Amazon EMR, Amazon Managed Service for Apache Flink, or Amazon Redshift.

Iceberg is well-liked as a result of first, it’s extensively supported by completely different open-source frameworks and distributors. Second, it permits prospects to learn and write knowledge concurrently utilizing completely different frameworks. For instance, you may write some information utilizing a batch ETL Spark job and different knowledge from a Flink software on the identical time and into the identical desk. Third, it permits situations resembling time journey and rollback, so you may run SQL queries on a point-in-time snapshot of your knowledge, or rollback knowledge to a beforehand recognized good model. Fourth, it helps schema evolution, so when your purposes evolve, you may add new columns to your tables with out having to rewrite knowledge or change present purposes. To be taught extra, see Apache Iceberg.

On this put up, we talk about how one can ship real-time knowledge streams into Iceberg tables on Amazon S3 through the use of Amazon Information Firehose. Amazon Information Firehose simplifies the method of streaming knowledge by permitting customers to configure a supply stream, choose an information supply, and set Iceberg tables because the vacation spot. As soon as arrange, the Firehose stream is able to ship knowledge. Firehose is built-in with over 20 AWS providers, so you may ship real-time knowledge from Amazon Kinesis Information Streams, Amazon Managed Streaming for Apache Kafka, Amazon CloudWatch Logs, AWS Web of Issues (AWS IoT), AWS WAF, Amazon Community Firewall Logs, or out of your customized purposes (by invoking the Firehose API) into Iceberg tables. It’s price efficient as a result of Firehose is serverless, you solely pay for the info despatched and written to your Iceberg tables. You don’t should provision something or pay something when your streams are idle throughout nights, weekends, or different non-use hours.

Firehose additionally simplifies establishing and working superior situations. For instance, if you wish to route knowledge to completely different Iceberg tables to have knowledge isolation or higher question efficiency, you may arrange a stream to robotically route information into completely different tables based mostly on what’s in your incoming knowledge and distribute information from a single stream into dozens of Iceberg tables. Firehose robotically scales—so that you don’t should plan for a way a lot knowledge goes into which desk—and has built-in mechanisms to deal with supply failures and assure precisely as soon as supply. Firehose helps updating and deleting information in a desk based mostly on the incoming knowledge stream, so you may assist situations resembling GDPR and right-to-forget laws. As a result of Firehose is totally suitable with Iceberg, you may write knowledge utilizing it and concurrently use different purposes to learn and write to the identical tables. Firehose integrates with the AWS Glue Information Catalog, so you should utilize options in AWS Glue resembling managed compaction for Iceberg tables.

Within the following sections, you’ll discover ways to arrange Firehose to ship real-time streams into Iceberg tables to deal with 4 completely different situations:

Ship knowledge from a stream right into a single Iceberg desk and insert all incoming information.
Ship knowledge from a stream right into a single Iceberg desk and carry out document inserts, updates, and deletes.
Route information to completely different tables based mostly on the content material of the incoming knowledge by specifying a JSON Question expression.
Route information to completely different tables based mostly on the content material of the incoming knowledge through the use of a Lambda operate.

Additionally, you will discover ways to question the info you’ve gotten delivered to Iceberg tables utilizing a regular SQL question in Amazon Athena. All the AWS providers utilized in these examples are serverless, so that you don’t should provision and handle any infrastructure.

Resolution overview

The next diagram illustrates the structure.

In our examples, we use Kinesis Information Generator, a pattern software to generate and publish knowledge streams to Firehose. You can even arrange Firehose to make use of different knowledge sources to your real-time streams. We arrange Firehose to ship the stream into Iceberg tables within the Information Catalog.

Walkthrough

This put up contains an AWS CloudFormation template for a fast setup. You’ll be able to assessment and customise it to fit your wants. The template performs the next operations:

Creates a Information Catalog database for the vacation spot Iceberg tables
Creates 4 tables within the AWS Glue database which might be configured to make use of the Apache Iceberg format
Specifies the S3 bucket areas for the vacation spot tables
Creates a Lambda operate (non-obligatory)
Units up an AWS Identification and Entry Administration (IAM) function for Firehose
Creates sources for Kinesis Information Generator

Stipulations

For this walkthrough, it’s best to have the next conditions:

An AWS account. Should you don’t have an account, you may create one.

Deploy the answer

Step one is to deploy the required sources into your AWS surroundings through the use of a CloudFormation template.

Check in to the AWS Administration Console for CloudFormation.
Select Launch Stack.
Select Subsequent.
Go away the stack identify as Firehose-Iceberg-Stack, and within the parameters, enter the username and password that you just need to use for accessing Kinesis Information Generator.
Go to the underside of the web page and choose I acknowledge that AWS CloudFormation may create IAM sources and select Subsequent.
Evaluation the deployment and select Submit.

The stack can take 5–10 minutes to finish, after which you’ll be able to view the deployed stack on the CloudFormation console. The next determine reveals the deployed Firehose-Iceberg-stack particulars.

Earlier than you arrange Firehose to ship streams, you could create the vacation spot tables within the Information Catalog. For the examples mentioned right here, we use the CloudFormation template to robotically create the tables used within the examples. On your customized purposes, you may create your tables utilizing CloudFormation, or through the use of DDL instructions in Athena or Glue. The next is the DDL command for making a desk utilized in our instance:

CREATE TABLE firehose_iceberg_db.firehose_events_1 (
sort struct<gadget: string, occasion: string, motion: string>,
customer_id string,
event_timestamp timestamp,
area string)
LOCATION 's3://firehose-demo-iceberg-<account_id>-<area>/iceberg/events_1'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Additionally be aware that the 4 tables that we use within the examples have the identical schema, however you may have tables with completely different schemas in your software.

Use case 1: Ship knowledge from a stream right into a single Iceberg desk and insert all incoming information

Now that you’ve got arrange the supply to your knowledge stream and the vacation spot tables, you’re able to arrange Firehose to ship streams into the Iceberg tables.

Create a Firehose stream:

Go to the Information Firehose console and select Create Firehose stream.
Choose Direct PUT because the Supply and Apache Iceberg Tables because the Vacation spot.

This instance makes use of Direct PUT because the supply, however the identical steps will be utilized for different Firehose sources resembling Kinesis Information Streams, and Amazon MSK.

For the Firehose stream identify, enter firehose-iceberg-events-1.
In Vacation spot settings, allow Inline parsing for routing data. As a result of all information from the stream are inserted right into a single vacation spot desk, you specify a vacation spot database and desk. By default, Firehose inserts all incoming information into the desired vacation spot desk.
1. Database expression: “firehose_iceberg_db”
2. Desk expression: “firehose_events_1”

Embody double citation marks to make use of the literal worth for the database and desk identify. If you don’t use double quotations marks, Firehose assumes that it is a JSON Question expression and can try and parse the expression when processing your stream and fail.

Go to Buffer hints and scale back the Buffer dimension to 1 MiB and the Buffer interval to 60 You’ll be able to tremendous tune these settings to your software.
For Backup settings:
- Choose the S3 bucket created by the CloudFormation template. It has the next construction: s3://firehose-demo-iceberg-<account_id>-<area>
- For error output prefix enter: error/events-1/
In Superior settings, allow CloudWatch error logging, and in Current IAM roles, choose the function that begins with Firehose-Iceberg-Stack-FirehoseIamRole-*, created by the CloudFormation template.
Select Create Firehose stream.

Generate streaming knowledge:

Use Kinesis Information Generator to publish knowledge information into your Firehose stream.

Go to the CloudFormation stack, choose the Nested stack for the generator, and select Outputs.
Select the KinesisDataGenerator URL and enter the credentials that you just outlined when deploying the CloudFormation stack.
Choose the AWS Area the place you deployed the CloudFormation stack and choose your Firehose stream.

For template, substitute the values on the display screen with the next:

{
"sort": {
"gadget": "{{random.arrayElement(["mobile", "desktop", "tablet"])}}",
"occasion": "{{random.arrayElement(["firehose_events_1", "firehose_events_2"])}}",
"motion": "replace"
},
"customer_id": "{{random.quantity({ "min": 1, "max": 1500})}}",
"event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
"area": "{{random.arrayElement(["pdx", "nyc"])}}"
}

Earlier than sending knowledge, select Take a look at template to see an instance payload.
Select Ship knowledge.

Querying with Athena:

You’ll be able to question the info you’ve written to your Iceberg tables utilizing completely different processing engines resembling Apache Spark, Apache Flink, or Trino. On this instance, we’ll present you the way you should utilize Athena to question knowledge that you just’ve written to Iceberg tables.

Go to the Athena console.
Configure a Location of question outcome. You should use the identical S3 bucket for this however add a suffix on the finish.
```
s3://firehose-demo-iceberg-<account_id>-<area>/athena/
```
Within the question editor, in Tables and views, choose the choices button subsequent to firehose_events_1 and choose Preview Desk.

It’s best to have the ability to see knowledge within the Apache Iceberg tables through the use of Athena.

With that, you ‘ve delivered knowledge streams into an Apache Iceberg desk utilizing Firehose and run a SQL question in opposition to your knowledge.

Now let’s discover the opposite situations. We are going to observe the identical process as earlier than for creating the Firehose stream and querying Iceberg tables with Amazon Athena.

Use case 2: Ship knowledge from a stream right into a single Iceberg desk and carry out document inserts, updates, and deletes

One of many benefits of utilizing Apache Iceberg is that it means that you can carry out row-level operations resembling updates and deletes on tables in an information lake. Firehose will be set as much as robotically apply document replace and delete operations in your Iceberg tables.

Issues to know:

If you apply an replace or delete operation by way of Firehose, the info in Amazon S3 isn’t truly deleted. As an alternative, a marker document is written based on the Apache Iceberg format specification to point that the document is up to date or deleted, so subsequent learn and write operations get the newest document. If you wish to purge (take away the underlying knowledge from Amazon S3) the deleted information, you should utilize instruments developed for purging information in Apache Iceberg.
Should you try and replace a document utilizing Firehose and the underlying document doesn’t exist already within the vacation spot desk, Firehose will insert the document as a brand new row.

Create a Firehose stream:

Go to the Amazon Information Firehose console.
Select Create Firehose stream.
For Supply, choose Direct PUT. For Vacation spot choose Apache Iceberg Tables.
For the Firehose stream identify, enter firehose-iceberg-events-2.
Within the e, allow inline parsing for routing data and supply the required values as static values for Database expression and Desk expression. Since you need to have the ability to replace information, you additionally have to specify the Operation expression.
1. Database expression: “firehose_iceberg_db”
2. Desk expression: “firehose_events_2”
3. Operation expression: “replace”

Embody double citation marks to make use of the literal worth for the database and desk identify. If you don’t use double quotations marks, Firehose assumes that it is a JSON Question expression and can try and parse the expression when processing your stream and fail.

Since you need to carry out replace and delete operations, you’ll want to present the columns within the vacation spot desk that will likely be used as distinctive keys to determine the document within the vacation spot to be up to date or deleted.
- For DestinationDatabaseName: “firehose_iceberg_db“
- For DestinationTableName: “firehose_events_2”
- In UniqueKeys, substitute the present worth with: “customer_id”
Change the Buffer hints to 1 MiB and 60
In Backup settings, choose the identical bucket from the stack, however enter the next within the error output prefix:
In Superior settings, allow CloudWatch Error logging and choose the present function of the stack and create the brand new Firehose stream.

Use Kinesis Information Generator to publish information into your Firehose stream. You may have to refresh the web page or change areas in order that it refreshes and reveals the newly created supply stream.

Don’t make any adjustments to the template and begin sending knowledge to the firehose-iceberg-events-2 stream.

Run the next question in Athena to see knowledge within the firehose_events_2 desk. Word that you would be able to ship up to date information for a similar distinctive key (identical customer_id worth) into your Firehose stream, and Firehose robotically applies document updates within the vacation spot desk. Thus, once you question knowledge in Athena, you will notice just one document for every distinctive worth of customer_id, even you probably have despatched a number of updates into your stream.

SELECT customer_id, depend(*) 
FROM "firehose_iceberg_db"."firehose_events_2" 
GROUP BY customer_id LIMIT 10;

Use case 3: Route information to completely different tables based mostly on the content material of the incoming knowledge by specifying a JSON Question expression

Till now, you supplied the routing and operation data as static values to carry out operations on a single desk. Nevertheless, you may specify JSON Question expressions to outline how Firehose ought to retrieve the vacation spot database, vacation spot desk, and operation out of your incoming knowledge stream, and accordingly route the document and carry out the corresponding operation. Based mostly in your specification, Firehose robotically routes and delivers every document into the suitable vacation spot desk and applies the corresponding operation.

Create a Firehose stream:

Return to the Amazon Information Firehose console.
Select Create Firehose Stream.
For Supply, choose Direct PUT. For Vacation spot, choose Apache Iceberg Tables.
For the Firehose stream identify, enter firehose-iceberg-events-3.
In Vacation spot settings, allow Inline parsing for routing data.
- For Database expression, present the identical worth as earlier than as a static string: “firehose_iceberg_db”
- For Desk expression, retrieve this worth from the nested incoming document utilizing JSON Question.
- For Operation expression, we may even retrieve this worth from our nested document utilizing JSON Question.

When you’ve got the next incoming occasions with completely different occasion values, With the previous JSON Question expressions, Firehose will parse and get “firehose_event_3” or “firehose_event_4” because the desk names, and “replace” because the supposed operation from the incoming information.

{ "sort": {   "gadget": "pill",  
"occasion": "firehose_events_3",   "motion": "replace" },
"customer_id": "112", "event_timestamp": "2024-10-02T15:46:52.901",
"area": "pdx"}
{ "sort": {   "gadget": "pill",  
"occasion": "firehose_events_4",   "motion": "replace" },
"customer_id": "112", "event_timestamp": "2024-10-02T15:46:52.901",
"area": "pdx"}

As a result of that is an replace operation, you’ll want to configure distinctive keys for every desk. Additionally, since you need to ship information to a number of Iceberg tables, you’ll want to present configurations for every of the 2 vacation spot tables that information will be written to.
Change the Buffer hints to 1 MiB and 60
In Backup settings, choose the identical bucket from the stack, however within the error output prefix enter the next:
In Superior settings, choose the present IAM function created by the CloudFormation stack and create the brand new Firehose stream.
In Kinesis Information Generator, refresh the web page and choose the newly created Firehose stream: firehose-iceberg-events-3

Should you question the firehose_events_3 and firehose_events_4 tables utilizing Athena, it’s best to discover the info routed to proper tables by Firehose utilizing the routing data retrieved utilizing JSON Question expressions.

Desk beneath displaying occasions with occasion “firehose_events_3”

The next determine reveals Firehose Occasions Desk 4.

Use Case 4: Route information to completely different tables based mostly on the content material of the incoming knowledge through the use of a Lambda operate

There could be situations the place routing data isn’t available within the enter document. You may need to parse and course of incoming information or carry out a lookup to find out the place to ship the document and whether or not to carry out an replace or delete operation. For such situations, you should utilize a Lambda operate to generate the routing data and operation specification. Firehose robotically invokes your Lambda operate for a batch of information (with a configurable batch dimension). You’ll be able to course of incoming information in your Lambda operate and supply the routing data and operation within the output of the operate. To be taught extra about the right way to course of Firehose information utilizing Lambda, see Remodel supply knowledge in Amazon Information Firehose. After executing your Lambda operate, Firehose appears for routing data and operations within the metadata fields (within the following format) supplied by your Lambda operate.

    "metadata":{
        "otfMetadata":{
            "destinationTableName":"firehose_iceberg_db",
            "destinationDatabaseName":"firehose_events_*",
            "operation":"insert"
        }

So, on this use case, you’ll discover how one can create customized routing guidelines based mostly on different values of your information. Particularly, for this use case, you’ll route all information with a price for Area of ‘pdx’ to desk 3 and all information with a area worth of ‘nyc’ to desk 4.

The CloudFormation template has already created the customized processing Lambda operate for you, which has the next code:

import base64
import json
print('Loading operate')

def lambda_handler(occasion, context):
    firehose_records_output = {'information': []}

    for document in occasion['records']:
        payload = base64.b64decode(document['data']).decode('utf-8')
        # Course of the payload based mostly on area
        payload_json = json.masses(payload)
        area = payload_json.get('area', '')
        firehose_record_output = {}
        if area == 'pdx':
            firehose_record_output['metadata'] = {
                'otfMetadata': {
                    'destinationDatabaseName': 'firehose_iceberg_db',
                    'destinationTableName': 'firehose_events_3',
                    'operation': 'insert'
                }
            }
        elif area == 'nyc':
            firehose_record_output['metadata'] = {
                'otfMetadata': {
                    'destinationDatabaseName': 'firehose_iceberg_db',
                    'destinationTableName': 'firehose_events_4',
                    'operation': 'insert'
                }
            }

        # Create output with correct document ID, output knowledge, outcome, and metadata
        firehose_record_output['recordId'] = document['recordId']
        firehose_record_output['result'] = 'Okay'
        firehose_record_output['data'] = base64.b64encode(payload.encode('utf-8'))
        firehose_records_output['records'].append(firehose_record_output)

    return firehose_records_output

Configure the Firehose stream:

Return to the Information Firehose console.
Select Create Firehose stream.
For Supply, choose Direct PUT. For Vacation spot, choose Apache Iceberg Tables.
For the Firehose stream identify, enter firehose-iceberg-events-4.
In Remodel information, choose Activate knowledge transformation.
Browse and choose the operate created by the CloudFormation stack:
- Firehose-Iceberg-Stack-FirehoseProcessingLambda-*.
- For Model choose $LATEST.
You’ll be able to depart the Vacation spot Settings as default as a result of the Lambda operate will present the required metadata for routing.
Change the Buffer hints to 1 MiB and 60 seconds.
In Backup settings, choose the identical S3 bucket from the stack, however within the error output prefix, enter the next:
In Superior settings, choose the present function of the stack and create the brand new Firehose stream.
In Kinesis Information Generator, refresh the web page and choose the newly created firehose stream: firehose-iceberg-events-4.

Should you run the next question, you will notice that the final information that had been inserted into the desk are solely within the Area of ‘nyc’.

SELECT * FROM "firehose_iceberg_db"."firehose_events_4" 
order by event_timestamp desc 
restrict 10;

Issues and limitations

Earlier than utilizing Information Firehose with Apache Iceberg, it’s essential to pay attention to issues and limitations. For extra data, see Issues and limitations.

Clear up

To keep away from future fees, delete the sources you created in AWS Glue, Information Catalog, and the S3 bucket used for storage.

Conclusion

It’s simple to arrange Firehose streams to ship streaming information into Apache Iceberg tables in Amazon S3. We hope that this put up helps you get began with constructing some wonderful purposes with out having to fret about writing and managing complicated software code or having to handle infrastructure.

To be taught extra about utilizing Amazon Information Firehose with Apache Iceberg, see the Firehose Developer Information or attempt the Immersion day workshop.

Concerning the authors

Diego Garcia Garcia is a Specialist SA Supervisor for Analytics at AWS. His experience spans throughout Amazon’s analytics providers, with a selected deal with real-time knowledge processing and superior analytics architectures. Diego leads a group of specialist options architects throughout EMEA, collaborating carefully with prospects spanning throughout a number of industries and geographies to design and implement options to their knowledge analytics challenges.

Francisco Morillo is a Streaming Options Architect at AWS. Francisco works with AWS prospects, serving to them design real-time analytics architectures utilizing AWS providers, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.

Phaneendra Vuliyaragoli is a Product Administration Lead for Amazon Information Firehose at AWS. On this function, Phaneendra leads the product and go-to-market technique for Amazon Information Firehose.

Stream real-time knowledge into Apache Iceberg tables in Amazon S3 utilizing Amazon Information Firehose

Resolution overview

Walkthrough

Stipulations

Deploy the answer

Use case 1: Ship knowledge from a stream right into a single Iceberg desk and insert all incoming information

Use case 2: Ship knowledge from a stream right into a single Iceberg desk and carry out document inserts, updates, and deletes

Use case 3: Route information to completely different tables based mostly on the content material of the incoming knowledge by specifying a JSON Question expression

Use Case 4: Route information to completely different tables based mostly on the content material of the incoming knowledge through the use of a Lambda operate

Issues and limitations

Clear up

Conclusion

Concerning the authors

Related Articles

Amazon Redshift publicizes historical past mode for zero-ETL integrations to simplify historic knowledge monitoring and evaluation

Manned-Unmanned Teaming UAVs Primoco UAV SE

Robert Pierce, Co-Founder & Chief Science Officer at DecisionNext – Interview Collection

LEAVE A REPLY Cancel reply

Latest Articles

Amazon Redshift publicizes historical past mode for zero-ETL integrations to simplify historic knowledge monitoring and evaluation

Manned-Unmanned Teaming UAVs Primoco UAV SE

Robert Pierce, Co-Founder & Chief Science Officer at DecisionNext – Interview Collection

Molecular-scale CO spillover on a dual-site electrocatalyst enhances methanol manufacturing from CO2 discount

Building 2025: Up or Down