That is half two of a three-part sequence the place we present construct a knowledge lake on AWS utilizing a fashionable information structure. This put up exhibits load information from a legacy database (SQL Server) right into a transactional information lake (Apache Iceberg) utilizing AWS Glue. We present construct information pipelines utilizing AWS Glue jobs, optimize them for each price and efficiency, and implement schema evolution to automate guide duties. To overview the primary a part of the sequence, the place we load SQL Server information into Amazon Easy Storage Service (Amazon S3) utilizing AWS Database Migration Service (AWS DMS), see Modernize your legacy databases with AWS information lakes, Half 1: Migrate SQL Server utilizing AWS DMS.
Resolution overview
On this put up, we go over the method of constructing a knowledge lake, offering the rationale behind the totally different choices, and share greatest practices when constructing such an answer.
The next diagram illustrates the totally different layers of the info lake.
To load information into the info lake, AWS Step Features can outline a workflow, Amazon Easy Queue Service (Amazon SQS) can observe the order of incoming information, and AWS Glue jobs and the Knowledge Catalog can be utilized create the info lake silver layer. AWS DMS produces information and writes these information to the bronze bucket (as we defined in Half 1).
We are able to activate Amazon S3 notifications and push the brand new arriving file names to an SQS first-in-first-out (FIFO) queue. A Step Features state machine can eat messages from this queue to course of the information within the order they arrive.
For processing the information, we have to create two varieties of AWS Glue jobs:
- Full load – This job hundreds your complete desk information dump into an Iceberg desk. Knowledge varieties from the supply are mapped to an Iceberg information kind. After the info is loaded, the job updates the Knowledge Catalog with the desk schemas.
- CDC – This job hundreds the change information seize (CDC) information into the respective Iceberg tables. The AWS Glue job implements the schema evolution function of Iceberg to deal with schema adjustments reminiscent of addition or deletion of columns.
As in Half 1, the AWS DMS jobs will place the complete load and CDC information from the supply database (SQL Server) within the uncooked S3 bucket. Now we course of this information utilizing AWS Glue and put it aside to the silver bucket in Iceberg format. AWS Glue has a plugin for Iceberg; for particulars, see Utilizing the Iceberg framework in AWS Glue.
Together with shifting information from the bronze to the silver bucket, we additionally create and replace the Knowledge Catalog for additional processing the info for the gold bucket.
The next diagram illustrates how the complete load and CDC jobs are outlined contained in the Step Features workflow.
On this put up, we focus on the AWS Glue jobs for outlining the workflow. We suggest utilizing AWS Step Features Workflow Studio, and establishing Amazon S3 occasion notifications and an SNS FIFO queue to obtain the filename as messages.
Stipulations
To comply with the answer, you want the next stipulations arrange in addition to sure entry rights and AWS Identification and Entry Administration (IAM) privileges:
- An IAM function to run Glue jobs
- IAM privileges to create AWS DMS sources (this function was created in Half 1 of this sequence; you should utilize the identical function right here)
- The AWS DMS job from Half 1 working and producing information for the supply database on Amazon S3.
Create an AWS Glue connection for the supply database
We have to create a connection between AWS Glue and the supply SQL Server database so the AWS Glue job can question the supply for the most recent schema whereas loading the info information. To create the connection, comply with these steps:
- On the AWS Glue console, select Connections within the navigation pane.
- Select Create customized connector.
- Give the connection a reputation and select JDBC because the connection kind.
- Within the JDBC URL part, enter the next string and change the identify of your supply database endpoint and database that was arrange in Half 1:
jdbc:sqlserver://{Your RDS Finish Level Identify}:1433/{Your Database Identify}
. - Choose Require SSL connection, then select Create connector.
Create and configure the complete load AWS Glue job
Full the next steps to create the complete load job:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Select Script editor and choose Spark.
- Select Begin contemporary and choose Create script.
- Enter a reputation for the complete load job and select the IAM function (talked about within the stipulations) for operating the job.
- End creating the job.
- On the Job particulars tab, develop Superior properties.
- Within the Connections part, add the connection you created.
- Beneath Job parameters, cross the next arguments to the job:
- target_s3_bucket – The silver S3 bucket identify.
- source_s3_bucket – The uncooked S3 bucket identify.
- secret_id – The ID of the AWS Secrets and techniques Supervisor secret for the supply database credentials.
- dbname – The supply database identify.
- datalake-formats – This units the info format to iceberg.
The total load AWS Glue job begins after the AWS DMS job reaches 100%. The job loops over the information positioned within the uncooked S3 bucket and processes them one at time. For every file, the job infers the desk identify from the file identify and will get the supply desk schema, together with column names and first keys.
If the desk has a number of main keys, the job creates an equal Iceberg desk. If the job has no main key, the file just isn’t processed. In our use case, all of the tables have main keys, so we implement this verify. Relying in your information, you may must deal with this situation in another way.
You need to use the next code to course of the complete load information. To start out the job, select Run.
When the job is full, it creates the database and tables within the Knowledge Catalog, as proven within the following screenshot.
Create and configure the CDC AWS Glue job
The CDC AWS Glue job is created much like the complete load job. As with the complete load AWS Glue job, you want to use the supply database connection and cross the job parameters with one extra parameter, cdc_file
, which comprises the placement of the CDC file to be processed. As a result of a CDC file can comprise information for a number of tables, the job loops over the tables in a file and hundreds the desk metadata from the supply desk ( RDS column names).
If the CDC operation is DELETE, the job deletes the data from the Iceberg desk. If the CDC operation is INSERT or UPDATE, the job merges the info into the Iceberg desk.
You need to use the next code to course of the CDC information. To start out the job, select Run
The Iceberg MERGE INTO
syntax can deal with circumstances the place a brand new column is added. For extra particulars on this function, see the Iceberg MERGE INTO syntax documentation. If the CDC job must course of many tables within the CDC file, the job could be multi-threaded to course of the file in parallel.
Â
Configure EventBridge notifications, SQS queue, and Step Features state machine
You need to use EventBridge notifications to ship notifications to EventBridge when sure occasions happen on S3 buckets, reminiscent of when new objects are created and deleted. For this put up, we’re within the occasions when new CDC information from AWS DMS arrive within the bronze S3 bucket. You possibly can create occasion notifications for brand new objects and insert the file names into an SQS queue. A Lambda perform inside Step Features would eat from the queue, extract the file identify, begin a CDC Glue job, and cross the file identify as a parameter to the job.
AWS DMS CDC information comprise database insert, replace, and delete statements. We have to course of these so as, so we use an SQS FIFO queue, which preserves the order of messages through which they arrive. You may also configure Amazon SQS to set a time to reside (TTL); this parameter defines how lengthy a message stays within the queue earlier than it expires.
One other vital parameter to contemplate when configuring an SQS queue is the message visibility timeout worth. Whereas a message is being processed, it disappears from the queue to make it possible for the message isn’t consumed by a number of customers (AWS Glue jobs in our case). If the message is consumed efficiently, it must be deleted from the queue earlier than the visibility timeout. Nevertheless, if the visibility timeout expires and the message isn’t deleted, the message reappears within the queue. In our resolution, this timeout should be higher than the time it takes for the CDC job to course of a file.
Lastly, we suggest utilizing Step Features to outline a workflow for dealing with the complete load and CDC information. Step Features has built-in integrations to different AWS companies like Amazon SQS, AWS Glue, and Lambda, which makes it a very good candidate for this use case.
The Step Features state machine begins with checking the standing of the AWS DMS job. The AWS DMS duties could be queried to verify the standing of the complete load, and we verify the worth of the parameter FullLoadProgressPercent
. When this worth will get to 100%, we are able to begin processing the complete load information. After the AWS Glue job processes the complete load information, we begin polling the SQS queue to verify the dimensions of the queue. If the queue dimension is bigger than 0, this implies new CDC information have arrived and we are able to begin the AWS Glue CDC job to course of these information. The AWS Glue jobs processes the CDC information and deletes the messages from the queue. When the queue dimension reaches 0, the AWS Glue job exits and we loop within the Step Features workflow to verify the SQS queue dimension.
As a result of the Step Features state machine is meant to run indefinitely, it’s good to understand that there might be service limits you want to adhere to. Particularly, the utmost runtime, which is 1 12 months, and most run historical past dimension, i.e., state transitions or occasions for a state machine which is 25,000. We suggest including an extra step on the finish to verify if both of those circumstances are being met to cease the present state machine run and begin a brand new one.
The next diagram illustrates how you should utilize Step Features state machine historical past dimension to watch and begin a brand new Step Features state machine run.
Configure the pipeline
The pipeline must be configured to handle price, efficiency, and resilience targets. You may want a pipeline that may load contemporary information into the info lake and make it accessible rapidly, and you may additionally wish to optimize prices by loading massive chunks of information into the info lake. On the identical time, it is best to make the pipeline resilient and have the ability to recuperate in case of failures. On this part, we cowl the totally different parameters and really helpful settings to realize these targets.
Step Features is designed to course of incoming AWS DMS CDC information by operating AWS Glue jobs. AWS Glue jobs can take a few minutes besides up, and once they’re operating, it’s environment friendly to course of massive chunks of information. You possibly can configure AWS DMS to jot down CSV information to Amazon S3 by configuring the next AWS DMS job parameters:
CdcMaxBatchInterval
– Defines the utmost time restrict AWS DMS will wait earlier than writing a batch to Amazon S3CdcMinFileSize
– Defines the minimal file dimension AWS DMS will write to Amazon S3
Whichever situation is met first will invoke the write operation. If you wish to prioritize information freshness, it is best to have a brief CdcMaxBatchInterval worth (10 seconds) and a small CdcMinFileSize worth (1–5 MB). It will end in many small CSV information being written to Amazon S3 and can invoke a variety of AWS Glue jobs to course of the info, making the extract, rework, and cargo (ETL) course of quicker. If you wish to optimize prices, it is best to have a reasonable CdcMaxBatchInterval (minutes) and a big CdcMinFileSize worth (100–500 MB). On this situation, we begin just a few AWS Glue jobs that can course of massive chunks of information, making the ETL stream extra environment friendly. In a real-world use case, the required values for these parameters may fall someplace that’s a very good compromise between throughput and price. You possibly can configure these parameters when making a goal endpoint utilizing the AWS DMS console, or by utilizing the create-endpoint command within the AWS Command Line Interface (AWS CLI).
For the complete listing of parameters, see Utilizing Amazon S3 as a goal for AWS Database Migration Service.
Choosing the proper AWS Glue employee varieties for the complete load and CDC jobs can also be essential for efficiency and price optimization. The AWS Glue (Spark) staff vary from G1X to G8X, which have an growing variety of information processing models (DPUs). Full load information are normally a lot bigger in dimension in comparison with CDC information, and due to this fact it’s extra cost- and performance-effective to pick a bigger employee. For CDC information, it could be more cost effective to pick a smaller employee as a result of information sizes are smaller.
It’s best to design the Step Features state machine in such a manner that if something fails, the pipeline could be redeployed after restore and resume processing from the place it left off. One vital parameter right here is TTL for the messages within the SQS queue. This parameter defines how lengthy a message stays within the queue earlier than expiring. In case of failures, we would like this parameter to be lengthy sufficient for us to deploy a repair. Amazon SQS has a most of 14 days for a message’s TTL. We suggest setting this to a big sufficient worth to reduce messages being expired in case of pipeline failures.
Clear up
Full the next steps to wash up the sources you created on this put up:
- Delete the AWS Glue jobs:
- On the AWS Glue console, select ETL jobs within the navigation pane.
- Choose the complete load and CDC jobs and on the Actions menu, select Delete.
- Select Delete to verify.
- Delete the Iceberg tables:
- On the AWS Glue console, beneath Knowledge Catalog within the navigation pane, select Databases.
- Select the database through which the Iceberg tables reside.
- Choose the tables to delete, select Delete, and make sure the deletion.
- Delete the S3 bucket:
- On the Amazon S3 console, select Buckets within the navigation pane.
- Select the silver bucket and empty the information within the bucket.
- Delete the bucket.
Conclusion
On this put up, we confirmed use AWS Glue jobs to load AWS DMS information right into a transactional information lake framework reminiscent of Iceberg. In our setup, AWS Glue offered extremely scalable and simple-to-maintain ETL jobs. Moreover, we share a proposed resolution utilizing Step Features to create an ETL pipeline workflow, with Amazon S3 notifications and an SQS queue to seize newly arriving information. We shared design this method to be resilient in direction of failures and to automate one of the time-consuming duties in sustaining a knowledge lake: schema evolution.
In Half 3, we are going to share course of the info lake to create information marts.
Concerning the Authors
Shaheer Mansoor is a Senior Machine Studying Engineer at AWS, the place he focuses on growing cutting-edge machine studying platforms. His experience lies in creating scalable infrastructure to help superior AI options. His focus areas are MLOps, function shops, information lakes, mannequin internet hosting, and generative AI.
Anoop Kumar Ok M is a Knowledge Architect at AWS with focus within the information and analytics space. He helps prospects in constructing scalable information platforms and of their enterprise information technique. His areas of curiosity are information platforms, information analytics, safety, file techniques and working techniques. Anoop likes to journey and enjoys studying books within the crime fiction and monetary domains.
Sreenivas Nettem is a Lead Database Advisor at AWS Skilled Companies. He has expertise working with Microsoft applied sciences with a specialization in SQL Server. He works carefully with prospects to assist migrate and modernize their databases to AWS.