3.7 C
United States of America
Saturday, November 23, 2024

Carry out knowledge parity at scale for knowledge modernization packages utilizing AWS Glue Knowledge High quality


As we speak, clients are embarking on knowledge modernization packages by migrating on-premises knowledge warehouses and knowledge lakes to the AWS Cloud to reap the benefits of the dimensions and superior analytical capabilities of the cloud. Clients are migrating their on-premises knowledge warehouse options constructed on databases like Netezza, PostgreSQL, Greenplum, and Teradata to AWS based mostly trendy knowledge platforms utilizing companies like Amazon Easy Storage Service (Amazon S3) and Amazon Redshift. AWS based mostly trendy knowledge platforms show you how to break down knowledge silos and allow analytics and machine studying (ML) use circumstances at scale.

Throughout migration, you may need to set up knowledge parity checks between on-premises databases and AWS knowledge platform companies. Knowledge parity is a course of to validate that knowledge was migrated efficiently from supply to focus on with none errors or failures. A profitable knowledge parity test signifies that knowledge within the goal platform has the equal content material, values, and completeness as that of the supply platform.

Knowledge parity will help construct confidence and belief with enterprise customers on the standard of migrated knowledge. Moreover, it may well show you how to determine errors within the new cloud-based extract, remodel, and cargo (ETL) course of.

Some clients construct customized in-house knowledge parity frameworks to validate knowledge throughout migration. Others use open supply knowledge high quality merchandise for knowledge parity use circumstances. These choices contain quite a lot of customized code, configurations, and set up, and have scalability challenges. This takes away necessary particular person hours from the precise migration effort into constructing and sustaining a knowledge parity framework.

On this submit, we present you easy methods to use AWS Glue Knowledge High quality, a function of AWS Glue, to determine knowledge parity throughout knowledge modernization and migration packages with minimal configuration and infrastructure setup. AWS Glue Knowledge High quality lets you robotically measure and monitor the standard of your knowledge in knowledge repositories and AWS Glue ETL pipelines.

Overview of resolution

In massive knowledge modernization initiatives of migrating from an on-premises database to an Amazon S3 based mostly knowledge lake, it’s widespread to have the next necessities for knowledge parity:

  • Examine one-time historic knowledge from the supply on-premises database to the goal S3 knowledge lake.
  • Examine ongoing knowledge that’s replicated from the supply on-premises database to the goal S3 knowledge lake.
  • Examine the output of the cloud-based new ETL course of with the present on-premises ETL course of. You’ll be able to plan a interval of parallel runs, the place the legacy and new techniques run in parallel, and the information is in contrast each day.
  • Use practical queries to check high-level aggregated enterprise metrics between the supply on-premises database and the goal knowledge lake.

On this submit, we use an instance of PostgreSQL migration from an on-premises database to an S3 knowledge lake utilizing AWS Glue Knowledge High quality.

The next diagram illustrates this use case’s historic knowledge migration structure.

The structure reveals a typical sample for on-premises databases (like PostgreSQL) to Amazon S3 based mostly knowledge lake migration. The workflow consists of the next steps:

  1. Schemas and tables are saved in an on-premises database (PostgreSQL), and also you need to migrate to Amazon S3 for storage and AWS Glue for compute.
  2. Use AWS Database Migration Service (AWS DMS) emigrate historic knowledge from the supply database to an S3 staging bucket.
  3. Use AWS Glue ETL to curate knowledge from the S3 staging bucket to an S3 curated bucket. Within the curated bucket, AWS Glue tables are created utilizing AWS Glue crawlers or an AWS Glue ETL job.
  4. Use an AWS Glue connection to attach AWS Glue with the on-premises PostgreSQL database.
  5. Use AWS Glue Knowledge High quality to check historic knowledge from the supply database to the goal S3 bucket and write outcomes to a separate S3 bucket.

The next diagram illustrates the incremental knowledge migration structure.

After historic knowledge is migrated and validated, the workflow proceeds to the next steps:

  1. Ingest incremental knowledge from the supply techniques to the S3 staging bucket. That is finished utilizing an ETL ingestion software like AWS Glue.
  2. Curate incremental knowledge from the S3 staging bucket to the S3 curated bucket utilizing AWS Glue ETL.
  3. Examine the incremental knowledge utilizing AWS Glue Knowledge High quality.

Within the subsequent sections, we show easy methods to use AWS Glue Knowledge High quality to determine knowledge parity between supply (PostgreSQL) and goal (Amazon S3). We cowl the next situations:

  • Set up knowledge parity for historic knowledge migration – Historic knowledge migration is outlined as a one-time bulk knowledge migration of historic knowledge from legacy on-premises databases to the AWS Cloud. The info parity course of maintains the validity of migrated historic knowledge.
  • Set up knowledge parity for incremental knowledge – After the historic knowledge migration, incremental knowledge is loaded to Amazon S3 utilizing the brand new cloud-based ETL course of. The incremental knowledge is in contrast between the legacy on-premises database and the AWS Cloud.
  • Set up knowledge parity utilizing practical queries – We carry out business- and functional-level checks utilizing SQL queries on migrated knowledge.

Conditions

You’ll want to arrange the next prerequisite sources:

Set up knowledge parity for historic knowledge migration

For historic knowledge migration and parity, we’re assuming that organising a PostgreSQL database, migrating knowledge to Amazon S3 utilizing AWS DMS, and knowledge curation have been accomplished as a prerequisite to carry out knowledge parity utilizing AWS Glue Knowledge High quality. For this use case, we use an on-premises PostgreSQL database with historic knowledge loaded on Amazon S3 and AWS Glue. Our goal is to check historic knowledge between the on-premises database and the AWS Cloud.

We use the next tables within the on-premises PostgreSQL database. These have been migrated to the AWS Cloud utilizing AWS DMS. As a part of knowledge curation, the next three further columns have been added to the test_schema.sample_data desk within the curated layer: id, etl_create_ts, and etl_user_id.

  1. Create sample_data with the next code:
create desk test_schema.sample_data
(
    job              textual content,
    firm          textual content,
    ssn              textual content,
    residence        textual content,
    current_location textual content,
    web site          textual content,
    username         textual content,
    title             textual content,
    gender_id        integer,
    blood_group_id   integer,
    handle          textual content,
    mail             textual content,
    birthdate        date,
    insert_ts        timestamp with time zone,
    update_ts        timestamp with time zone
);

  1. Create gender with the next code (incorporates gender particulars for lookup):
create desk test_schema.gender
(
    gender_id integer,
    worth     varchar(1)
);

  1. Create blood_group with the next code (incorporates blood group data for lookup):
create desk test_schema.blood_group
(
    blood_group_id integer,
    worth          varchar(10)
);

We’re assuming that the previous tables have been migrated to the S3 staging bucket utilizing AWS DMS and curated utilizing AWS Glue. For detailed directions on easy methods to arrange AWS DMS to duplicate knowledge, discuss with the appendix on the finish of this submit.

Within the following sections, we showcase easy methods to configure an AWS Glue Knowledge High quality job for comparability.

Create an AWS Glue connection

AWS Glue Knowledge High quality makes use of an AWS Glue connection to connect with the supply PostgreSQL database. Full the next steps to create the connection:

  1. On AWS Glue console, below Knowledge Catalog within the navigation pane, select Connections.
  2. Select Create connection.
  3. Set Connector kind as JDBC.
  4. Add connection particulars just like the connection URL, credentials, and networking particulars.

Check with AWS Glue connection properties for extra particulars.

AWS Glue 4.0 makes use of PostgreSQL JDBC driver 42.3.6. In case your PostgreSQL database requires a unique model JDBC driver, obtain the JDBC driver equivalent to your PostgreSQL model.

Create an AWS Glue knowledge parity job for historic knowledge comparability

As a part of the previous steps, you used AWS DMS to drag historic knowledge from PostgreSQL to the S3 staging bucket. You then used an AWS Glue pocket book to curate knowledge from the staging bucket to the curated bucket and created AWS Glue tables. As a part of this step, you utilize AWS Glue Knowledge High quality to check knowledge between PostgreSQL and Amazon S3 to verify the information is legitimate. Full the next steps to create an AWS Glue job utilizing the AWS Glue visible editor to check knowledge between PostgreSQL and Amazon S3:

  1. Set the supply because the PostgreSQL desk sample_data.
  2. Set the goal because the AWS Glue desk sample_data.

  1. Within the curated layer, we added just a few further columns: id, etl_create_ts, and etl_user_id. As a result of these columns are newly created, we use a change to drop these columns for comparability.
  2. Moreover, the birth_date column is a timestamp in AWS Glue, so we modify it up to now format previous to comparability.

  1. Select Consider Knowledge High quality in Transformations.
  2. Specify the AWS Glue Knowledge High quality rule as DatasetMatch, which checks if the information within the main dataset matches the information in a reference dataset.
  3. Present the distinctive key (main key) data for supply and goal. On this instance, the first secret is a mix of columns job and username.

  1. For Knowledge high quality remodel output, specify your knowledge to output:
    1. Authentic knowledge – This output consists of all rows and columns in unique knowledge. As well as, you’ll be able to choose Add new columns to point knowledge high quality errors. This selection provides metadata columns for every row that can be utilized to determine legitimate and invalid rows and the foundations that failed validation. You’ll be able to additional customise row-level output to pick solely legitimate rows or convert the desk format based mostly on the use case.
    2. Knowledge high quality outcomes – This can be a abstract output grouped by a rule. For our knowledge parity instance, this output may have one row with a abstract of the match share.

  1. Configure the Amazon S3 targets for ruleOutcomes and rowLevelOutcomes to write down AWS Glue Knowledge High quality output within the Amazon S3 location in Parquet format.

  1. Save and run the AWS Glue job.
  2. When the AWS Glue job is full, you’ll be able to run AWS Glue crawler to robotically create rulesummary and row_level_output tables and think about the output in Amazon Athena.

The next is an instance of rule-level output. The screenshot reveals the DatasetMatch worth as 1.0, which suggests all rows between the supply PostgreSQL database and goal knowledge lake matched.

The next is an instance of row-level output. The screenshot reveals all supply rows together with further columns that verify if a row has handed or failed validation.

Let’s replace just a few information in PostgreSQL to simulate a knowledge difficulty in the course of the knowledge migration course of:

replace test_schema.sample_data set residence = null the place blood_group_id = 8
Information up to date 1,272

You’ll be able to rerun the AWS Glue job and observe the output in Athena. Within the following screenshot, the brand new match share is 87.27%. With this instance, you have been in a position to seize the simulated knowledge difficulty with AWS Glue Knowledge High quality efficiently.

If you happen to run the next question, the output will match the file depend with the previous screenshot:

SELECT depend(*) FROM "gluedqblog"."rowleveloutput" the place dataqualityevaluationresult="Failed"

Set up knowledge parity for incremental knowledge

After the preliminary historic migration, the following step is to implement a course of to validate incremental knowledge between the legacy on-premises database and the AWS Cloud. For incremental knowledge validation, knowledge output from the present ETL course of and the brand new cloud-based ETL course of is in contrast each day. You’ll be able to add a filter to the previous AWS Glue knowledge parity job to pick knowledge that has been modified for a given day utilizing a timestamp column.

Set up knowledge parity utilizing practical queries

Purposeful queries are SQL statements that enterprise analysts can run within the legacy system (for this submit, an on-premises database) and the brand new AWS Cloud-based knowledge lake to check knowledge metrics and output. To verify the patron functions work appropriately with migrated knowledge, it’s crucial to validate knowledge functionally. The earlier examples are primarily technical validation to verify there isn’t a knowledge loss within the goal knowledge lake after knowledge ingestion from each historic migration and alter knowledge seize (CDC) context. In a typical knowledge warehouse migration use case, the historic migration pipeline usually pulls knowledge from a knowledge warehouse, and the incremental or CDC pipeline integrates the precise supply techniques, which feed the information warehouse.

Purposeful knowledge parity is the third step within the general knowledge validation framework, the place you’ve gotten the flexibleness to proceed comparable enterprise metrics validation pushed by an aggregated SQL question. You’ll be able to assemble your personal enterprise metrics validation question, ideally working with material specialists (SMEs) from the enterprise aspect. We’ve seen that agility and perfection matter for a profitable knowledge warehouse migration, subsequently reusing the time-tested and enterprise SME-approved aggregated SQL question from the legacy knowledge warehouse system with minimal modifications can fast-track the implementation in addition to preserve enterprise confidence. On this part, we show easy methods to implement a pattern practical parity for a given dataset.

On this instance, we use a set of supply PostgreSQL tables and goal S3 knowledge lake tables for comparability. We use an AWS Glue crawler to create Knowledge Catalog tables for the supply tables, as described within the first instance.

The pattern practical validation compares the distribution depend of gender and blood group for every firm. This may very well be any practical question that joins information and dimension tables and performs aggregations.

You should use a SQL transformation to generate an aggregated dataset for each the supply and goal question. On this instance, the supply question makes use of a number of tables. Apply SQL features on the columns and required filter standards.

The next screenshot illustrates the Supply Purposeful Question remodel.

The next screenshot illustrates the Goal Purposeful Question remodel.

The next screenshot illustrates the Consider Knowledge High quality remodel. You’ll be able to apply the DatasetMatch rule to attain a 100% match.

After the job runs, you will discover the job run standing on AWS Glue console.

The Knowledge high quality tab shows the information high quality outcomes.

AWS Glue Knowledge High quality offers row- and rule-level outputs, as described within the earlier examples.

Test the rule-level output within the Athena desk. The result of the DatasetMatch rule reveals a 100% match between the PostgreSQL supply dataset and goal knowledge lake.

Test the row-level output within the Athena desk. The next screenshot shows the row-level output with knowledge high quality analysis outcomes and rule standing.

Let’s change the corporate worth for Spencer LLC to Spencer LLC – New to simulate the influence on the information high quality rule and general outcomes. This creates a niche within the depend of information for the given firm title whereas evaluating supply and goal.

By rerunning the job and checking the AWS Glue Knowledge High quality outcomes, you’ll uncover that the information high quality rule has failed. That is because of the distinction in firm title between the supply and goal dataset as a result of the information high quality rule analysis is monitoring a 100% match. You’ll be able to cut back the match share within the knowledge high quality expression based mostly on the required threshold.

Subsequent, revert the modifications made for the information high quality rule failure simulation.


If you happen to rerun the job and validate the AWS Glue Knowledge High quality outcomes, you will discover the information high quality rating is again to 100%.


Clear up

If you happen to not need to preserve the sources you created as a part of this submit in your AWS account, full the next steps:

  1. Delete the AWS Glue pocket book and visible ETL jobs.
  2. Take away all knowledge and delete the staging and curated S3 buckets.
  3. Delete the AWS Glue connection to the PostgreSQL database.
  4. Delete the AWS DMS replication activity and occasion.
  5. Delete the Knowledge Catalog.

Conclusion

On this submit, we mentioned how you should utilize AWS Glue Knowledge High quality to construct a scalable knowledge parity pipeline for knowledge modernization packages. AWS Glue Knowledge High quality lets you preserve the standard of your knowledge by automating most of the handbook duties concerned in knowledge high quality monitoring and administration. This helps stop unhealthy knowledge from getting into your knowledge lakes and knowledge warehouses. The examples on this submit offered an summary on easy methods to arrange historic, incremental, and practical knowledge parity jobs utilizing AWS Glue Knowledge High quality.

To study extra about AWS Glue Knowledge High quality, discuss with Evaluating knowledge high quality with AWS Glue Studio and AWS Glue Knowledge High quality. To dive into the AWS Glue Knowledge High quality APIs, see Knowledge High quality API.

Appendix

On this part, we show easy methods to arrange AWS DMS and replicate knowledge. You should use AWS DMS to repeat one-time historic knowledge from the PostgreSQL database to the S3 staging bucket. Full the next steps:

  1. On the AWS DMS console, below Migrate knowledge within the navigation pane, select Replication cases.
  2. Select Create a replication occasion.
  3. Select a VPC that has connectivity to the PostgreSQL occasion.

After the occasion is created, it ought to seem with the standing as Accessible on the AWS DMS console.

  1. 4. Primarily based on our resolution structure, you now create an S3 staging bucket for AWS DMS to write down replicated output. For this submit, the staging bucket title is gluedq-blog-dms-staging-bucket.
  2. Below Migrate knowledge within the navigation pane, select Endpoints.
  3. Create a supply endpoint for the PostgreSQL connection.

  1. After you create the supply endpoint, select Take a look at endpoint to verify it’s connecting efficiently to the PostgreSQL Occasion.
  2. Equally, create a goal endpoint with the S3 staging bucket as a goal and check the goal endpoint.

  1. We’ll be writing replicated output from PostgreSQL in CSV format. the addColumnName=true; property within the AWS DMS configuration to verify the schema data is written as headers in CSV output.

Now you’re able to create the migration activity.

  1. Below Migrate knowledge within the navigation pane, select Database migration duties.
  2. Create a brand new replication activity.
  3. Specify the supply and goal endpoints you created and select the desk that must be replicated.

After the replication activity is created, it would begin replicating knowledge robotically.

When the standing reveals as Load full, knowledge ought to seem within the following S3 places (the bucket title on this instance is a placeholder):

  • s3://<gluedq-blog-dms-staging-bucket>/staging_layer/test_schema/sample_data/
  • s3://<gluedq-blog-dms-staging-bucket>/staging_layer/test_schema/gender/
  • s3://<gluedq-blog-dms-staging-bucket>/staging_layer/test_schema/blood_group/

Concerning the Authors

Himanshu Sahni is a Senior Knowledge Architect in AWS Skilled Companies. Himanshu focuses on constructing Knowledge and Analytics options for enterprise clients utilizing AWS instruments and companies. He’s an knowledgeable in AI/ ML and Huge Knowledge instruments like Spark, AWS Glue and Amazon EMR. Exterior of labor, Himanshu likes taking part in chess and tennis.

Arunabha Datta is a Senior Knowledge Architect at AWS Skilled Companies. He collaborates with clients and companions to create and execute trendy knowledge structure utilizing AWS Analytics companies. Arunabha’s ardour lies in helping clients with digital transformation, significantly within the areas of knowledge lakes, databases, and AI/ML applied sciences. Apart from work, his hobbies embrace pictures and he likes to spend high quality time together with his household.

Charishma Ravoori is an Affiliate Knowledge & ML Engineer at AWS Skilled Companies. She focuses on growing options for patrons that embrace constructing out knowledge pipelines, growing predictive fashions and producing ai chatbots utilizing AWS/Amazon instruments. Exterior of labor, Charishma likes to experiment with new recipes and play the guitar.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles