-9.9 C
United States of America
Monday, January 20, 2025

Simplify knowledge integration with AWS Glue and zero-ETL to Amazon SageMaker Lakehouse


With the rising emphasis on knowledge, organizations are continually in search of extra environment friendly and agile methods to combine their knowledge, particularly from all kinds of purposes. Whereas conventional extract, rework, and cargo (ETL) processes have lengthy been a staple of information integration attributable to its flexibility, for widespread use circumstances equivalent to replication and ingestion, they usually show time-consuming, advanced, and fewer adaptable to the fast-changing calls for of recent knowledge architectures.

As well as, organizations depend on an more and more various array of digital methods, knowledge fragmentation has develop into a big problem. Priceless data is usually scattered throughout a number of repositories, together with databases, purposes, and different platforms. To harness the complete potential of their knowledge, companies should allow seamless entry and consolidation from these different sources. Nevertheless, this process is difficult by the distinctive traits of recent methods, equivalent to differing API protocols, implementations, and price limits. To deal with these challenges and speed up innovation, AWS Glue has lately expanded its third-party software assist by introducing native connectors for 19 purposes.

To make the most of these new software connectors for well-defined use circumstances equivalent to replication and ingestion, AWS Glue can also be launching zero-ETL integration assist from exterior purposes. With this new performance, clients can create up-to-date replicas of their knowledge from purposes equivalent to Salesforce, ServiceNow, and Zendesk in an Amazon SageMaker Lakehouse and Amazon Redshift.

Amazon SageMaker Lakehouse unifies all of your knowledge throughout Amazon S3 knowledge lakes and Amazon Redshift knowledge warehouses, serving to you construct highly effective analytics and AI/ML purposes on a single copy of information. SageMaker Lakehouse provides you the flexibleness to entry and question your knowledge in-place with all Apache Iceberg suitable instruments and engines. By straight integrating with Lakehouse, all the information is mechanically cataloged and will be secured via fine-grained permissions in Lake Formation.

What’s zero-ETL?

Zero-ETL is a set of totally managed integrations by AWS that minimizes the necessity to construct ETL knowledge pipelines. It makes knowledge obtainable in Amazon SageMaker Lakehouse and Amazon Redshift from a number of operational, transactional, and enterprise sources. Extract, rework, and cargo (ETL) is the method of mixing, cleansing, and normalizing knowledge from completely different sources to organize it for analytics, synthetic intelligence (AI), and machine studying (ML) workloads. You don’t want to keep up advanced ETL pipelines. We deal with the ETL for you by automating the creation and administration of information replication.

What’s the distinction between zero-ETL and Glue ETL?

AWS Glue now provides a number of methods so that you can construct knowledge integration pipelines, relying in your integration wants.

  • Zero-ETL offers service-managed replication. It’s designed for situations the place clients want a totally managed, environment friendly method to replicate knowledge from one supply to AWS with minimal configuration. Zero-ETL handles the whole replication course of, together with schema discovery and evolution, with out requiring clients to write down or handle any customized logic. This method is good for creating up-to-date replicas of supply knowledge in near-real-time, with AWS managing the underlying infrastructure and replication course of.
  • Glue ETL provides customer-managed knowledge ingestion. It’s the popular alternative when clients want extra management and customization over the information integration course of or require advanced transformations. With Glue ETL, clients can write customized transformation logic, mix knowledge from a number of sources, apply knowledge high quality guidelines, add calculated fields, and carry out superior knowledge cleaning or aggregation. This flexibility makes Glue ETL appropriate for situations the place knowledge have to be reworked or enriched earlier than evaluation.

It’s price mentioning that the supply connections are reusable between Glue ETL and Glue zero-ETL so that may simply assist each patterns. After you create a connection as soon as, you possibly can select to make use of the identical connection throughout numerous AWS Glue parts together with Glue ETL, Glue Visible ETL and zero-ETL.  For instance, you would possibly begin by making a connection and a zero-ETL integration, however resolve later to make use of the identical connection to create a customized GlueETL pipeline.

This weblog put up will discover how zero-ETL capabilities mixed with its new software connectors are reworking the way in which companies combine and analyze their knowledge from well-liked platforms equivalent to ServiceNow, Salesforce, Zendesk, SAP and others.

Use case

Contemplate a big firm that depends closely on data-driven insights to optimize its buyer assist processes. The corporate shops huge quantities of transactional knowledge in ServiceNow. To achieve a complete understanding of their enterprise and make knowledgeable selections, the corporate must combine and analyze knowledge from ServiceNow seamlessly, figuring out and addressing issues and root causes, managing service stage agreements and compliance, and proactively planning for incident prevention.

The corporate is searching for an environment friendly, scalable, and cost-effective answer to gathering and ingesting knowledge from ServiceNow, guaranteeing steady close to real-time replication, automated availability of latest knowledge attributes, strong monitoring capabilities to trace knowledge load statistics, and dependable knowledge lake basis supporting knowledge versioning. This permits knowledge analysts, knowledge engineers, and knowledge scientists to shortly discover ingested knowledge and develop knowledge merchandise that meet the wants of enterprise groups.

Answer overview

The next structure diagram illustrates an environment friendly and scalable answer for gathering and ingesting replicated knowledge from ServiceNow with zero-ETL integration. On this instance we use ServiceNow as a supply, however this may be accomplished with any supported supply equivalent to Salesforce, Zendesk, SAP, or others. The AWS Glue managed connectors act as a bridge between ServiceNow and the goal Amazon SageMaker Lakehouse, enabling seamless, close to real-time knowledge circulation with out the necessity for customized ETL and scheduling.

The next are the important thing parts and steps within the integration course of:

  1. Zero-ETL extracts and masses the information into Amazon S3, a extremely scalable object storage service. The info can also be registered within the Glue Knowledge Catalog, a metadata repository. Moreover, it retains the data synchronized by capturing adjustments that happen in ServiceNow and maintains knowledge consistency by mechanically performing schema evolution.
  2. Amazon CloudWatch, a monitoring and observability service, collects logs and metrics from the information integration course of.
  3. Amazon EventBridge, a serverless occasion bus service, triggers a downstream course of that permits you to construct event-driven structure as quickly as your new knowledge arrives in your goal. By means of EventBridge, clients can construct on high of zero-ETL for a various set of use circumstances equivalent to:

Stipulations

Full the next conditions earlier than organising the answer:

  1. Create a bucket in Amazon S3 known as zero-etl-demo-<your AWS Account Quantity>-<AWS Area> (for instance, zero-etl-demo-012345678901-us-east-1). The bucket can be used to retailer the information ingested by zero-ETL in Apache Iceberg which is an open desk format (OTF) supporting ACID transactions (atomicity, consistency, isolation, and sturdiness), seamless schema evolution, and knowledge versioning utilizing time journey.
  2. Create an AWS Glue database <your database identify>, equivalent to zero_etl_demo_db and affiliate the S3 bucket zero-etl-demo-<your AWS Account Quantity>-<AWS Area> as a location of the database. The database can be used to retailer the metadata associated to the information integrations carried out by zero-ETL.
  3. Replace AWS Glue Knowledge Catalog settings utilizing the next IAM coverage for fine-grained entry management of the information catalog for zero-ETL.
  4. Create an AWS Identification and Entry Administration (IAM) position named zero_etl_demo_role. The IAM position can be utilized by zero-ETL to entry the Glue Connector to learn from the Service Now and write the information into the goal. Optionally, you possibly can create two separate IAM roles (one related along with your supply knowledge and one other related along with your goal).
  5. Be sure you have a ServiceNow occasion named ServiceNowInstance, a person named ServiceNowUser, and a password passwordServiceNowPassword with the required permissions to learn from ServiceNow. The occasion identify, person, and password are used within the AWS Glue connection to authenticate inside ServiceNow utilizing the BASIC authentication kind. Optionally, you possibly can select OAUTH2 in case your ServiceNow helps it.
  6. Create the key zero_etl_demo_secret in AWS Secrets and techniques Supervisor to retailer ServiceNow credentials.

Construct and confirm the zero-ETL integration

Full the next steps to create and validate zero-ETL integration:

Step 1: Arrange a connector

Zero-ETL integration, when used with AWS Glue natively supported purposes connectors, offers a simple method to deliver third-party knowledge into an Amazon S3 transactional knowledge lake or Amazon Redshift. Use the next steps to create a ServiceNow knowledge connection:

  1. Open the AWS Glue console.
  2. Within the navigation pane, beneath Knowledge catalog, select Connections.
  3. Select Create Connection.
  4. Within the Create Connection pane, enter ServiceNow in Knowledge Sources.
  5. Select ServiceNow.
  6. Select Subsequent.
  7. For Occasion Title, enter ServiceNowInstance (created as a part of the conditions).
  8. For IAM service position, select the zero_etl_demo_role (created as a part of the conditions).
  9. For Authentication Kind, select the authentication kind that you simply’re utilizing for ServiceNow. On this instance. now we have chosen OAUTH2, which requires the arrange of Software Registries in ServiceNow.
  10. For AWS Secret, select the key zero_etl_demo_secret (created as a part of the conditions).
  11. Select Subsequent.
  12. Within the Connection Properties part, for Title, enter zero_etl_demo_conn.
  13. Select Subsequent.
  14. Select Create connection.

  15. There can be a popup from ServiceNow after you select Create connection. Select Enable.

Step 2: Arrange Zero-ETL integration

After creating the information connection to ServiceNow, use the next steps to create the zero-ETL integration:

  1. Open the AWS Glue console.
  2. Within the navigation pane, beneath Knowledge catalog, select Zero-ETL integrations.
  3. Select Create zero-ETL integration.
  4. Within the Create integration pane, enter ServiceNow in Knowledge Sources.
  5. Select ServiceNow.
  6. Select Subsequent.
  7. For ServiceNow connection, select the information connection created on Step 1—zero_etl_demo_conn.
  8. For Supply IAM position, select the zero_etl_demo_role (from the conditions).
  9. For ServiceNow objects, select the objects you wish to carry out the ingestion managed by zero-ETL integration. For this put up, select downside and incident objects.
  10. For Namespace or Database, select <your database identify>. On this instance, we use the zero_etl_demo_db (from the conditions).
  11. For Goal IAM position, select the zero_etl_demo_role (from the conditions).
  12. Select Subsequent.
  13. For Safety and knowledge encryption, you possibly can select both AWS Managed KMS Key or select a buyer KMS key managed by AWS Key Administration Service. For this put up, select Use AWS managed KMS key.
  14. Within the Integration particulars part, for Title, enter zero-etl-demo-integration.
  15. Select Subsequent.
  16. Assessment the small print and select Create and launch integration.
  17. The newly created integration will present as Energetic in a couple of minute.

Step 3: Confirm the preliminary SEED load

The SEED load refers back to the preliminary loading of the tables that you simply wish to ingest into an Amazon SageMaker Lakehouse utilizing zero-ETL integration. The standing and statistics of the SEED load are revealed into CloudWatch and the information ingested by zero-ETL integration will be accessed in AWS utilizing a set of companies such Amazon Sagemaker Unified StudioAmazon QuickSight, and others. Use the next steps to entry zero-ETL integration logs and question the information:

  1. Open the AWS Glue console.
  2. Within the navigation pane, select Zero-ETL integrations.
  3. Within the Zero-ETL integrations part, select zero-etl-demo-integration.
  4. Within the Exercise abstract (all time) part, select CloudWatch logs.
  5. Test CloudWatch log occasions for the SEED Load. For every desk ingested by the zero-ETL integration, two teams of logs are created: standing and statistics. Highlighted within the following screenshot in IngestionTableStatistics are the statistics. The insertCount represents what number of rows have been extracted and loaded by zero-ETL integration. For the SEED load, you’ll all the time see solely insertCount as a result of it’s the preliminary load. As well as, in IngestionCompleted one can find details about the Zero-ETL integration equivalent to standing, load kind, and message.

To validate the SEED load, question the information utilizing Amazon Sagemaker Unified Studio.

  1. Entry Amazon Sagemaker Unified Studio on your particular area via your AWS Console.
  2. Open the Amazon SageMaker Unified Studio URL.
  3. Sign up with SSO or AWS IAM person.
  4. Choose your mission.
  5. Go to Knowledge from the left menu, develop the Lakehouse AWSDataCatalog, develop your database, and choose the incident desk. Click on the icon and choose Question with Athena.
  6. For Question, enter the next assertion:
    SELECT rely(*) AS incidents_count
    FROM "zero_etl_demo_db"."incident"

  7. Select Run.
  8. Let’s examine an current incident in ServiceNow. That is the incident that you’ll replace the outline of in ServiceNow to validate change knowledge seize (CDC). Within the question editor, pane, for Question, enter the next assertion:
    SELECT quantity
    , short_description
    , description
    FROM "zero_etl_demo_db"."incident"
    WHERE quantity="INC0000003" -- replace to your Incident quantity

  9. Select Run.

Step 4: Validate CDC

The CDC load is a method used to determine and course of solely the information that has modified in a supply system because the final extraction. As a substitute of reloading a whole dataset, CDC captures and transfers solely the brand new, up to date, or deleted information into the goal system, making knowledge processing extra environment friendly and decreasing load instances. The standing and statistics of the CDC load are revealed into CloudWatch. For this put up, you’ll use Amazon SageMaker unified studio to question the information ingested. Use the next steps to entry zero-ETL integration logs and question the information ingested. For the subsequent step on this instance, you’ll choose an incident and carry out an replace in ServiceNow, altering the short_description and outline of the incident.

  1. To exhibit CDC occasion, on this weblog we’re going to edit 1 incident and delete 1 incident in ServiceNow.
  2. Open the AWS Glue console.
  3. Within the navigation pane, beneath Knowledge catalog, select Zero-ETL integrations.
  4. Within the Zero-ETL integrations part, select zero-etl-demo-integration.
  5. Within the Exercise abstract (all time) part, select CloudWatch logs.
  6. Zero-ETL integration replicates the adjustments to the Amazon S3 transactional knowledge lake each 60 minutes by default. Test CloudWatch log occasions for the CDC load. Proven within the following determine in IngestionTableStatistics, overview updateCount and deleteCount for every particular object managed by zero-ETL integration. It’s making use of the updates and deletes that occurred in ServiceNow to the transactional knowledge lake.

To validate the CDC load, question the information utilizing Amazon SageMaker Unified Studio.

  1. You’ll be able to return to Amazon SageMaker Unified Studio.
  2. For Question, enter the next assertion:
    SELECT rely(*) AS incidents_count
    FROM "zero_etl_demo_db"."incident"

  3. For Question, enter the next assertion to document preliminary snapshot outcomes earlier than CDC:
    SELECT quantity
        , short_description
        , description
    FROM "zero_etl_demo_db"."incident"
    WHERE quantity="INC0000003" -- replace to your Incident quantity

  4. Select Run and make sure that one document was up to date in short_description and description attributes.

By following these steps, you possibly can successfully arrange, construct, and confirm a zero-ETL job utilizing the brand new AWS Glue software connector for ServiceNow. This course of demonstrates the simplicity and effectivity of the zero-ETL method in integrating purposes knowledge into your AWS surroundings.

Apache Iceberg Time Journey: Enhancing knowledge versioning in zero-ETL

One of many advantages of utilizing Apache Iceberg in zero-ETL integration is the flexibility to carry out Time Journey. This characteristic permits you to entry and question historic variations of your knowledge effortlessly. With Iceberg Time Journey, you possibly can simply roll again to earlier knowledge states, evaluate knowledge throughout completely different deadlines, or get well from unintended knowledge adjustments. Within the context of zero-ETL integrations, this functionality turns into significantly helpful when coping with quickly altering purposes knowledge.

To exhibit this characteristic, let’s think about a situation the place you’re analyzing ServiceNow incident knowledge ingested via zero-ETL integration utilizing Amazon SageMaker Unified Studio. Right here’s an instance question that showcases Iceberg time journey:

-- Question incident knowledge as of specific timestamp earlier than CDC
SELECT quantity,
    short_description,
    description
FROM "zero_etl_demo_db"."incident" 
FOR TIMESTAMP AS OF TIMESTAMP '2024-11-06 05:10:00 UTC' 
-- replace this timestamp worth to earlier than your CDC replace
WHERE quantity="INC0000003" -- replace to your Incident quantity
-- Examine with present knowledge
SELECT quantity,
    short_description,
    description
FROM "zero_etl_demo_db"."incident"
WHERE quantity="INC0000003" -- replace to your Incident quantity

On this instance:

  1. The primary question makes use of the FOR TIMESTAMP AS OF clause for time journey queries on Iceberg tables. It retrieves incident knowledge because it existed earlier than CDC replace for the particular incident quantity INC0000003.
  2. The second question fetches the present state of the information for a similar incident quantity.

This functionality permits you to observe the evolution of incidents, determine developments in decision instances, or get well data that will have been inadvertently altered.

Clear up

To keep away from incurring future expenses, take away up the assets used on this put up out of your AWS account by finishing the next steps:

  1. Delete zero-ETL integration zero-etl-demo-integration.
  2. Delete content material from the S3 bucket zeroetl-etl-demo-<your AWS Account Quantity>-<AWS Area>.
  3. Delete the Knowledge Catalog database zero_etl_demo_db.
  4. Delete the Knowledge Catalog connection zero_etl_demo_conn.
  5. Delete the AWS Secrets and techniques supervisor Secret.

Conclusion

Because the tempo of enterprise continues to speed up, the flexibility to shortly and effectively combine knowledge from numerous purposes and enterprise platforms has develop into a crucial aggressive benefit. By adopting a zero-ETL integration powered by AWS Glue and its new set of managed connectors, you group can unlock the complete potential of its knowledge throughout a number of platforms quicker and keep forward of the curve.

To study extra about how AWS Amazon SageMaker Lakehouse may help your group streamline its knowledge integration efforts, go to Amazon SageMaker Lakehouse.

Get began with zero-ETL on AWS by making a free account at the moment!


In regards to the authors

Shovan Kanjilal is a Senior Analytics and Machine Studying Architect with Amazon Net Companies. He’s obsessed with serving to clients construct scalable, safe and high-performance knowledge options within the cloud.

Vivek Pinyani is a Knowledge Architect at AWS Skilled Companies with experience in Massive Knowledge applied sciences. He focuses on serving to clients construct strong and performant Knowledge Analytics options and Knowledge Lake migrations. In his free time, he likes to spend time along with his household and enjoys enjoying cricket and working.

Kartikay KhatorKartikay Khator is a Options Architect inside International Life Sciences at AWS, the place he dedicates his efforts to creating modern and scalable options that cater to the evolving wants of consumers. His experience lies in harnessing the capabilities of AWS analytics companies. Extending past his skilled pursuits, he finds pleasure and success on the earth of working and climbing. Having already accomplished a number of marathons, he’s at present getting ready for his subsequent marathon problem.

Caio Sgaraboto Montovani is a Sr. Specialist Options Architect, Knowledge Lake and AI/ML inside AWS Skilled Companies, creating scalable options in accordance buyer wants. His huge expertise has helped clients in numerous industries equivalent to life sciences and healthcare, retail, banking, and aviation construct options in knowledge analytics, machine studying, and generative AI. He’s obsessed with rock and roll and cooking and likes to spend time along with his household.

Kamen SharlandjievKamen Sharlandjiev is a Sr. Massive Knowledge and ETL Options Architect, Amazon MWAA and AWS Glue ETL skilled. He’s on a mission to make life simpler for purchasers who’re dealing with advanced knowledge integration and orchestration challenges. His secret weapon? Totally managed AWS companies that may get the job accomplished with minimal effort. Observe Kamen on LinkedIn to maintain updated with the most recent Amazon MWAA and AWS Glue options and information!

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles