Streamline AWS WAF log evaluation with Apache Iceberg and Amazon Information Firehose

February 19, 2025

4

Organizations are quickly increasing their digital presence, creating alternatives to serve clients higher by way of internet functions. AWS WAF logs play an important function on this growth by enabling organizations to proactively monitor safety, implement compliance, and strengthen software protection. AWS WAF log evaluation is crucial throughout many industries, together with banking, retail, and healthcare, every needing to ship safe digital experiences.

To optimize their safety operations, organizations are adopting trendy approaches that mix real-time monitoring with scalable knowledge analytics. They’re utilizing knowledge lake architectures and Apache Iceberg to effectively course of giant volumes of safety knowledge whereas minimizing operational overhead. Apache Iceberg combines enterprise reliability with SQL simplicity when working with safety knowledge saved in Amazon Easy Storage Service (Amazon S3), enabling organizations to give attention to safety insights slightly than infrastructure administration.

Apache Iceberg enhances safety analytics by way of a number of key capabilities. It seamlessly integrates with numerous AWS providers and evaluation instruments whereas supporting concurrent read-write operations for simultaneous log ingestion and evaluation. Its time journey characteristic permits thorough safety forensics and incident investigation, and its schema evolution help permits groups to adapt to rising safety patterns with out disrupting current workflows. These capabilities make Apache Iceberg a great alternative for constructing sturdy safety analytics options. Nevertheless, organizations typically battle when constructing their very own options to ship knowledge to Apache Iceberg tables. These embody managing advanced extract, remodel, and cargo (ETL) processes, dealing with schema validation, offering dependable supply, and sustaining customized code for knowledge transformations. Groups should additionally construct resilient error dealing with, implement retry logic, and handle scaling infrastructure—all whereas sustaining knowledge consistency and excessive availability. These challenges take priceless time away from analyzing safety knowledge and deriving insights.

To handle these challenges, Amazon Information Firehose supplies real-time knowledge supply to Apache Iceberg tables inside seconds. Firehose delivers excessive reliability throughout a number of Availability Zones whereas robotically scaling to match throughput necessities. It’s absolutely managed and requires no infrastructure administration or customized code improvement. Firehose delivers streaming knowledge with configurable buffering choices that may be optimized for near-zero latency. It additionally supplies built-in knowledge transformation, compression, and encryption capabilities, together with automated retry mechanisms to offer dependable knowledge supply. This makes it a great alternative for streaming AWS WAF logs immediately into an information lake whereas minimizing operational overhead.

On this submit, we display how one can construct a scalable AWS WAF log evaluation answer utilizing Firehose and Apache Iceberg. Firehose simplifies the complete course of—from log ingestion to storage—by permitting you to configure a supply stream that delivers AWS WAF logs on to Apache Iceberg tables in Amazon S3. The answer requires no infrastructure setup and also you pay just for the info you course of.

Answer overview

To implement this answer, you first configure AWS WAF logging to seize internet visitors data. This captures detailed details about visitors analyzed by the net entry management lists (ACLs). Every log entry contains the request timestamp, detailed request data, and rule matches that have been triggered. These logs are constantly streamed to Firehose in actual time.

Firehose writes these logs into an Apache Iceberg desk, which is saved in Amazon S3. When Firehose delivers knowledge to the S3 desk, it makes use of the AWS Glue Information Catalog to retailer and handle desk metadata. This metadata contains schema data, partition particulars, and file places, enabling seamless knowledge discovery and querying throughout AWS analytics providers.

Lastly, safety groups can analyze knowledge within the Apache Iceberg tables utilizing numerous AWS providers, together with Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker. For this demonstration, we use Athena to run SQL queries in opposition to the safety logs.

The next diagram illustrates the answer structure.

The implementation consists of 4 steps:

Deploy the bottom infrastructure utilizing AWS CloudFormation.
Create an Apache Iceberg desk utilizing an AWS Glue pocket book.
Create a Firehose stream to deal with the log knowledge.
Configure AWS WAF logging to ship knowledge to the Apache Iceberg desk by way of the Firehose stream.

You possibly can deploy the required sources into your AWS setting within the US East (N. Virginia) AWS Area utilizing a CloudFormation template. This template creates an S3 bucket for storing AWS WAF logs, an AWS Glue database for the Apache Iceberg tables, and the AWS Id and Entry Administration (IAM) roles and insurance policies wanted for the answer.

Conditions

Earlier than you get began, be sure to have the next stipulations:

An AWS account with entry to the US East (N. Virginia) Area
AWS WAF configured with an online ACL within the US East (N. Virginia) Area

In case you don’t have AWS WAF arrange, discuss with the AWS WAF Workshop to create a pattern internet software with AWS WAF.

AWS WAF logs use case-sensitive discipline names (like httpRequest and webaclId). For profitable log ingestion, this answer makes use of the Apache Iceberg API by way of an AWS Glue job to create tables—it is a dependable method that preserves the precise discipline names from the AWS WAF logs. Though AWS Glue crawlers and Athena DDLs provide handy methods to create Apache Iceberg tables, they convert mixed-case column names to lowercase, which might have an effect on AWS WAF log processing. By utilizing an AWS Glue job with the Apache Iceberg API, case-sensitivity of column names is preserved, offering correct mapping between AWS WAF log fields and desk columns.

Deploy the CloudFormation stack

Full the next steps to deploy the answer sources with AWS CloudFormation:

Sign up to the AWS CloudFormation console.
Select Launch Stack.
Select Subsequent.
For Stack title, depart as WAF-Firehose-Iceberg-Stack.
Beneath Parameters, specify whether or not AWS Lake Formation permissions are for use for the AWS Glue tables.
Select Subsequent.

Choose I acknowledge that AWS CloudFormation may create IAM sources with customized names and select Subsequent.

Assessment the deployment and select Submit.

The stack takes a number of minutes to deploy. After the deployment is full, you’ll be able to assessment the sources created by navigating to the Sources tab on the CloudFormation stack.

Create an Apache Iceberg desk

Earlier than organising the Firehose supply stream, you will need to create the vacation spot Apache Iceberg desk within the Information Catalog. That is accomplished utilizing AWS Glue jobs and the Apache Iceberg API, as mentioned earlier. Full the next steps to create an Apache Iceberg desk:

On the AWS Glue console, select Notebooks below ETL jobs within the navigation pane.

Select Pocket book possibility below Create job.

Beneath Choices, choose Begin contemporary.
For IAM function, select WAF-Firehose-Iceberg-Stack-GlueServiceRole-*.
Select Create pocket book.

Enter the next configuration command within the pocket book to configure the Spark session with Apache Iceberg extensions. Remember to replace the configuration for sql.catalog.glue_catalog.warehouse to the S3 bucket created by the CloudFormation template.

%%configure
{
    "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://<S3BucketName>/waflogdata --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",
    "--datalake-formats": "iceberg"
}

Enter the next SQL within the AWS Glue pocket book to create the Apache Iceberg desk:

# Word: This code makes use of Glue model 5.0 (as of April 2024)
# Please verify AWS Glue launch notes for the most recent model and replace accordingly:
# https://docs.aws.amazon.com/glue/newest/dg/release-notes.html
# To replace: Change the %glue_version parameter under to the most recent model

%idle_timeout 2880
%glue_version 5.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.conf import SparkConf

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

spark.sql(""" CREATE TABLE glue_catalog.waf_logs_db.firehose_waf_logs(
  `timestamp` bigint,
  `formatVersion` int,
  `webaclId` string,
  `terminatingRuleId` string,
  `terminatingRuleType` string,
  `motion` string,
  `terminatingRuleMatchDetails` array <
                                    struct <
                                        conditiontype: string,
                                        sensitivitylevel: string,
                                        location: string,
                                        matcheddata: array < string >
                                          >
                                     >,
  `httpSourceName` string,
  `httpSourceId` string,
  `ruleGroupList` array <
                      struct <
                          rulegroupid: string,
                          terminatingrule: struct <
                                              ruleid: string,
                                              motion: string,
                                              rulematchdetails: array <
                                                                   struct <
                                                                       conditiontype: string,
                                                                       sensitivitylevel: string,
                                                                       location: string,
                                                                       matcheddata: array < string >
                                                                          >
                                                                    >
                                                >,
                          nonterminatingmatchingrules: array <
                                                              struct <
                                                                  ruleid: string,
                                                                  motion: string,
                                                                  overriddenaction: string,
                                                                  rulematchdetails: array <
                                                                                       struct <
                                                                                           conditiontype: string,
                                                                                           sensitivitylevel: string,
                                                                                           location: string,
                                                                                           matcheddata: array < string >
                                                                                              >
                                                                   >,
                                                                  challengeresponse: struct <
                                                                            responsecode: string,
                                                                            solvetimestamp: string
                                                                              >,
                                                                  captcharesponse: struct <
                                                                            responsecode: string,
                                                                            solvetimestamp: string
                                                                              >
                                                                    >
                                                             >,
                          excludedrules: string
                            >
                       >,
`rateBasedRuleList` array <
                         struct <
                             ratebasedruleid: string,
                             limitkey: string,
                             maxrateallowed: int
                               >
                          >,
  `nonTerminatingMatchingRules` array <
                                    struct <
                                        ruleid: string,
                                        motion: string,
                                        rulematchdetails: array <
                                                             struct <
                                                                 conditiontype: string,
                                                                 sensitivitylevel: string,
                                                                 location: string,
                                                                 matcheddata: array < string >
                                                                    >
                                                             >,
                                        challengeresponse: struct <
                                                            responsecode: string,
                                                            solvetimestamp: string
                                                             >,
                                        captcharesponse: struct <
                                                            responsecode: string,
                                                            solvetimestamp: string
                                                             >
                                          >
                                     >,
  `requestHeadersInserted` array <
                                struct <
                                    title: string,
                                    worth: string
                                      >
                                 >,
  `responseCodeSent` string,
  `httpRequest` struct <
                    clientip: string,
                    nation: string,
                    headers: array <
                                struct <
                                    title: string,
                                    worth: string
                                      >
                                 >,
                    uri: string,
                    args: string,
                    httpversion: string,
                    httpmethod: string,
                    requestid: string
                      >,
  `labels` array <
               struct <
                   title: string
                     >
                >,
  `CaptchaResponse` struct <
                        responsecode: string,
                        solvetimestamp: string,
                        failureReason: string
                          >,
  `ChallengeResponse` struct <
                        responsecode: string,
                        solvetimestamp: string,
                        failureReason: string
                        >,
  `ja3Fingerprint` string,
  `overSizeFields` string,
  `requestBodySize` int,
  `requestBodySizeInspectedByWAF` int
)
USING iceberg
TBLPROPERTIES ("format-version"="2")
""")
job.commit()

Navigate to the Information Catalog and waf_logs_db database to substantiate the desk firehose_waf_logs is created.

Create a Firehose stream

Full the next steps to create a Firehose stream:

On the Information Firehose console, select Create Firehose stream.

Select Direct PUT for Supply and Apache Iceberg Tables for Vacation spot.

For Firehose stream title, enter aws-waf-logs-firehose-iceberg-1.

Within the Vacation spot settings part, allow Inline parsing for routing data. As a result of we’re sending all data to 1 desk, specify the vacation spot database and desk names:
1. For Database expression, enter "waf_logs_db".
2. For Desk expression, enter "firehose_waf_logs".

Make certain to incorporate double citation marks to make use of the literal worth for the database and desk title. In case you don’t use double citation marks, Firehose assumes that it is a JSON question expression and can try and parse the expression when processing your stream and fail. Firehose also can path to totally different Apache Iceberg Tables based mostly on the content material of the info. For extra data, discuss with Route incoming data to totally different Iceberg Tables.

For S3 backup bucket, enter the S3 bucket created by the CloudFormation template.
For S3 backup bucket error output prefix, enter error/events-1/.

Beneath Superior settings, choose Allow server-side encryption for supply data in Firehose stream.

For Current IAM roles, select the function that begins with WAF-Firehose-Iceberg-stack-FirehoseIAMRole-*, created by the CloudFormation template.
Select Create Firehose stream.

Configure AWS WAF logs to the Firehose stream

Full the next steps to configure AWS WAF logs to the Firehose stream.

On the AWS WAF console, select Net ACLs within the navigation pane.

Select your internet ACL.
On the Logging and metrics tab, select Allow.

For Amazon Information Firehose stream, select the stream aws-waf-logs-firehose-iceberg-1.
Select Save.

Question and analyze the logs

You possibly can question the info you’ve written to your Apache Iceberg tables utilizing totally different processing engines, similar to Apache Spark, Apache Flink, or Trino. On this instance, we use Athena to question AWS WAF logs knowledge saved in Apache Iceberg tables. Full the next steps:

On the Athena console, select Settings within the high proper nook.
For Location of question consequence, enter the S3 bucket created by the CloudFormation template

s3://<S3BucketName>/athena/

Enter the AWS account ID for Anticipated bucket proprietor and select save.

Within the question editor, in Tables and views, select the choices menu subsequent to firehose_waf_logs and select Preview Desk.

It’s best to be capable of see the AWS WAF logs within the Apache Iceberg tables by utilizing Athena.

The next are some extra helpful instance queries:

Establish potential assault sources by analyzing blocked IP addresses:

-- High 10 blocked IP addresses
SELECT httpRequest.clientip, COUNT() as block_count
FROM waf_logs_db.firehose_waf_logs
WHERE motion = 'BLOCK'
GROUP BY httpRequest.clientip
ORDER BY block_count DESC
LIMIT 10;

Monitor assault patterns and traits over time:

-- Fee of blocked requests over time
SELECT DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000)) as hour,
       COUNT() as request_count
FROM waf_logs_db.firehose_waf_logs
WHERE motion = 'BLOCK'
GROUP BY DATE_TRUNC('hour', FROM_UNIXTIME(timestamp/1000))
ORDER BY hour;

Apache Iceberg desk optimization

Though Firehose permits environment friendly streaming of AWS WAF logs into Apache Iceberg tables, the character of streaming writes can lead to many small information being created. It is because Firehose delivers knowledge based mostly on its buffering configuration, which might result in suboptimal question efficiency. To handle this, common desk optimization is beneficial.

There are two beneficial desk optimization approaches:

Compaction – Information compaction merges small knowledge information to cut back storage utilization and enhance learn efficiency. Information information are merged and rewritten to take away out of date knowledge and consolidate fragmented knowledge into bigger, extra environment friendly information.
Storage optimization – You possibly can handle storage overhead by eradicating older, pointless snapshots and their related underlying information. Moreover, this contains periodically deleting orphan information to keep up environment friendly storage utilization and optimum question efficiency.

These optimizations could be applied utilizing both the Information Catalog or Athena.

Desk optimization utilizing the Information Catalog

The Information Catalog supplies automated desk optimization options. Throughout the desk optimization characteristic, you’ll be able to configure particular optimizers for compaction, snapshot retention, and orphan file deletion. A desk optimization schedule could be managed and standing could be monitored from the AWS Glue console.

Desk optimization utilizing Athena

Athena helps guide optimization by way of SQL instructions. The OPTIMIZE command rewrites small information into bigger information and applies file compaction:

OPTIMIZE waf_logs_db.firehose_waf_logs REWRITE DATA USING BIN_PACK

The VACUUM command removes previous snapshots and cleans up expired knowledge information:

ALTER TABLE waf_logs_db.firehose_waf_logs SET TBLPROPERTIES (
  'vacuum_max_snapshot_age_seconds'='259200'
)

VACUUM waf_logs_db.firehose_waf_logs

You possibly can monitor the desk’s optimization standing utilizing the next question:

SELECT * FROM "waf_logs_db"."firehose_waf_logs$information"

Clear up

To keep away from future expenses, full the next steps:

Empty the S3 bucket.
Delete the CloudFormation stack.
Delete the Firehose stream.
Disable AWS WAF logging.

Conclusion

On this submit, we demonstrated how one can construct an AWS WAF log analytics pipeline utilizing Firehose to ship AWS WAF logs to Apache Iceberg tables on Amazon S3. The answer handles large-scale AWS WAF log processing with out requiring advanced code or infrastructure administration. Though this submit centered on Apache Iceberg tables because the vacation spot, Information Firehose additionally seamlessly integrates with Amazon S3 tables. To optimize your tables for querying, Amazon S3 Tables constantly performs automated upkeep operations, similar to compaction, snapshot administration, and unreferenced file elimination. These operations improve desk efficiency by compacting smaller objects into fewer, bigger information.

To get began with your individual implementation, strive the answer in your AWS account and discover the next sources for extra options and greatest practices:

In regards to the Authors

Charishma Makineni is a Senior Technical Account Supervisor at AWS. She supplies strategic technical steering for Unbiased Software program Distributors (ISVs) to construct and optimize options on AWS. She makes a speciality of Large Information and Analytics applied sciences, serving to organizations optimize their data-driven initiatives on AWS.

Phaneendra Vuliyaragoli is a Product Administration Lead for Amazon Information Firehose at AWS. On this function, Phaneendra leads the product and go-to-market technique for Amazon Information Firehose.

Streamline AWS WAF log evaluation with Apache Iceberg and Amazon Information Firehose

Answer overview

Conditions

Deploy the CloudFormation stack

Create an Apache Iceberg desk

Create a Firehose stream

Configure AWS WAF logs to the Firehose stream

Question and analyze the logs

Apache Iceberg desk optimization

Desk optimization utilizing the Information Catalog

Desk optimization utilizing Athena

Clear up

Conclusion

In regards to the Authors

Related Articles

Determine’s Helix: AI that Brings Human-Like Robots to your Residence

Microsoft at Legalweek: Assist safeguard your AI future with Microsoft Purview

Hannover Messe preview seems to be towards options

LEAVE A REPLY Cancel reply

Latest Articles

Determine’s Helix: AI that Brings Human-Like Robots to your Residence

Microsoft at Legalweek: Assist safeguard your AI future with Microsoft Purview

Hannover Messe preview seems to be towards options

PCI DSS 4.0 Mandates DMARC By thirty first March 2025

VAST Knowledge Expands Platform With Block Storage And Actual-Time Occasion Streaming