Streamlining AWS Glue Studio visible jobs: Constructing an built-in CI/CD pipeline for seamless setting synchronization

November 12, 2024

25

Many Amazon Net Providers (AWS) clients have built-in their knowledge throughout a number of sources utilizing AWS Glue, a serverless knowledge integration service. By offering seamless integration all through the event lifecycle, AWS Glue allows organizations to make data-driven enterprise choices.

AWS Glue Studio visible jobs present a graphical interface known as the visible editor that you should utilize to creator extract, remodel, and cargo (ETL) jobs in AWS Glue visually. The visible editor maintains a visible illustration that a wide range of knowledge sources, transformations, and knowledge sinks. With its intuitive interface, you’ll be able to simply create large-scale knowledge integration jobs without having coding experience, simplifying workflows and eliminating the necessity for handbook ETL script programming.

As knowledge engineers more and more depend on the AWS Glue Studio visible editor to create knowledge integration jobs, the necessity for a streamlined growth lifecycle and seamless synchronization between environments has turn out to be paramount. Moreover, managing variations of visible directed acyclic graphs (DAGs) is essential for monitoring modifications, collaboration, and sustaining consistency throughout environments.

This put up introduces an end-to-end resolution that addresses these wants by combining the facility of the AWS Glue Visible Job API, a customized AWS Glue Useful resource Sync Utility, and an based mostly steady integration and steady deployment (CI/CD) pipeline.

A number of widespread questions from our clients embrace:

What are the perfect practices for shifting our workloads from a pre-production setting to manufacturing?
What are the really helpful greatest practices for provisioning knowledge integration parts?
How can I construct AWS Glue visible jobs within the growth setting and routinely propagate them to the manufacturing account utilizing the CI/CD pipeline?
How can I model management and observe modifications to my AWS Glue Studio visible jobs?

Finish-to-end growth lifecycle for knowledge integration pipeline

The software program growth lifecycle on AWS has six phases: plan, design, implement, check, deploy, and preserve, as proven within the following diagram.

SDLC

For extra info relating to every part, take a look at Finish-to-end growth lifecycle for knowledge engineers to construct an information integration pipeline utilizing AWS Glue.

AWS Glue Useful resource Sync Utility

As a part of synchronizing AWS Glue visible jobs throughout completely different environments, necessities embrace:

Handle model management of visible DAGs by monitoring modifications to AWS Glue Studio visible jobs utilizing model management programs akin to Git
Promote AWS Glue visible jobs from a pre-production setting to a manufacturing setting
Switch possession of AWS Glue visible jobs between completely different AWS accounts
Replicate AWS Glue visible jobs from one AWS Area to a different as a part of a catastrophe restoration technique

The AWS Glue Useful resource Sync Utility is a Python software developed on high of the AWS Glue Visible Job API, designed to synchronize AWS Glue Studio visible jobs throughout completely different accounts with out shedding the visible illustration. It operates through the use of supply and goal AWS setting profiles. Optionally, a listing of jobs for synchronization may be supplied together with a mapping file to switch environment-specific sources.

For extra info on the AWS Glue Useful resource Sync Utility, consult with Synchronize your AWS Glue Studio Visible Jobs to completely different environments.

Answer overview

As proven within the following diagram, this resolution makes use of three separate AWS accounts. One account is designated for the event setting, one other for the manufacturing setting, and a 3rd to host the CI/CD infrastructure and pipeline.

Solution Overview

The answer emphasizes model controlling AWS Glue Studio visible jobs by serializing them into JSON recordsdata and storing them in a Git repository. Consequently, you’ll be able to:

Monitor modifications to your visible DAGs over time.
Collaborate with group members.
Restore and deploy visible DAGs in numerous environments seamlessly.

The AWS account accountable for internet hosting the CI/CD pipeline consists of three key parts:

Managing AWS Glue Job updates – Supplies clean updates and upkeep of AWS Glue jobs.
Cross-Account Entry Administration – Permits safe promotion of updates from the event setting to the manufacturing setting.
Model Management Integration – Incorporates serialized visible DAGs into the CI/CD pipeline for deployment to focus on environments.

You may create AWS Glue Studio visible jobs utilizing the intuitive visible editor in your growth account. After these jobs are configured, they’ll serialize the visible DAGs into JSON recordsdata and commit them to a Git repository. The CI/CD pipeline detects modifications to the repository and routinely triggers the deployment course of.

The pipeline features a step the place the AWS Glue Useful resource Sync Utility deserializes the visible DAGs from the dedicated JSON recordsdata and deploys them to the manufacturing setting. This strategy promotes constant deployment of jobs whereas sustaining their visible illustration.

The answer makes use of the AWS Glue Visible Job API, AWS Glue Useful resource Sync Utility, and AWS CDK to streamline deployment throughout environments. It allows seamless synchronization and constant versioning of AWS Glue jobs between growth and manufacturing, preserving visible workflows and lowering handbook duties. The answer consists of two predominant components:

Preliminary steps (one-time setup) – Organising the event setting, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Useful resource Sync Utility
Day-to-day growth (repeated) – Ongoing actions akin to creating visible jobs, serializing them, committing modifications to the repository, deploying to manufacturing via the pipeline, and verifying the roles

The answer follows these high-level steps for the preliminary setup:

Arrange the event setting
Bootstrap your AWS environments
Deploy the CI/CD pipeline
Configure AWS developer instruments connection on GitHub
Combine the CI/CD pipeline with the AWS Glue Useful resource Sync Utility

The answer follows these high-level steps for the day-to-day growth:

Create visible jobs within the growth account
Serialize visible jobs
Commit modifications to Git repository
Deploy visible jobs to manufacturing
Confirm visible jobs in manufacturing

Stipulations

Earlier than you start, ensure you have the next:

GitHub account
Git (git command)
Python 3.9 or later
Bundle installer for Python (pip command)
AWS CDK Toolkit (cdk command) 2.155.0 or later
AWS CLI configured with acceptable credentials on your accounts
Three AWS accounts:
- Improvement account
- Manufacturing account
- Pipeline account (for internet hosting the CI/CD pipeline)

Technical resolution walkthrough

This part supplies an in depth information to organising and utilizing an automatic CI/CD pipeline for AWS Glue Studio visible jobs.

Preliminary steps (one-time setup)

On this part, we stroll via the foundational steps required to ascertain the CI/CD pipeline for AWS Glue Studio visible jobs. These preliminary steps arrange the required infrastructure and configurations, offering a clean and automatic deployment course of throughout your growth and manufacturing environments.

Arrange the event setting

To arrange the event setting, comply with these steps:

Fork the aws-glue-cdk-baseline repository
Clone the forked repository:

git clone https://github.com/<YOUR-GITHUB-USERNAME>/aws-glue-cdk-baseline.git

cd aws-glue-cdk-baseline

Create and activate a Python digital setting:

python3 -m venv .venv

# On Home windows, use .venvScriptsactivate.bat
supply .venv/bin/activate

Set up required dependencies:

pip set up -r necessities.txt

pip set up -r requirements-dev.txt

To configure the default settings, edit the default-config.yaml file along with your AWS account particulars and change placeholders along with your AWS account particulars:
Pipeline account: awsAccountId and awsRegion.
Improvement account: awsAccountId and awsRegion.
Manufacturing account: awsAccountId and awsRegion.

Bootstrap your AWS environments

Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the next instructions, changing placeholders along with your account numbers, Areas, and AWS CLI profiles:

# Bootstrap the pipeline account
cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE>

# Bootstrap the event account, trusting the pipeline account
cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

# Bootstrap the manufacturing account, trusting the pipeline account
cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

Deploy the CI/CD pipeline

Deploy the pipeline stack to your pipeline account:

cdk deploy --profile <PIPELINE-PROFILE>

This command creates:

The pipeline stack within the pipeline account
The AWS Glue app stack within the growth account

Configure AWS developer instruments connection to GitHub

To determine a connection between AWS CodePipeline and your GitHub repository, comply with these steps:

Create a GitHub connection
Within the AWS Administration Console on your pipeline account, navigate to AWS CodePipeline
Within the navigation pane, select Connections
Select Create connection
Choose GitHub because the supply supplier
Authorize the connection
Present a connection title (akin to MyGitHubConnection)
Select Hook up with GitHub
Comply with the prompts to authorize AWS CodePipeline to entry your GitHub account
Make it possible for the connection has entry to your forked aws-glue-cdk-baseline repository
Notice the connection Amazon Useful resource Title (ARN)
After the connection is established, notice the Connection ARN since you’ll want it when configuring the pipeline

Combine the CI/CD pipeline with the AWS Glue Useful resource Sync Utility

To combine the AWS Glue Useful resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visible jobs, comply with these steps:

Obtain the sync.py script from the AWS Glue Samples repository:

wget https://uncooked.githubusercontent.com/aws-samples/aws-glue-samples/grasp/utilities/resource_sync/sync.py 
-O aws_glue_cdk_baseline/job_scripts/sync.py

Create a brand new file aws_glue_cdk_baseline/job_scripts/generate_mapping.py with the next content material:

import yaml
import json
 
def generate_mapping():
    with open('default-config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
    mapping = {
        f"s3://aws-glue-assets-{config['devAccount']['awsAccountId']}-{config['devAccount']['awsRegion']}": f"s3://aws-glue-assets-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}",
        f"arn:aws:iam::{config['devAccount']['awsAccountId']}:position/service-role/AWSGlueServiceRole": f"arn:aws:iam::{config['prodAccount']['awsAccountId']}:position/service-role/AWSGlueServiceRole",
        f"s3://dev-glue-data-{config['devAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}": f"s3://prod-glue-data-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}"
    }
    with open('mapping.json', 'w') as mapping_file:
        json.dump(mapping, mapping_file, indent=2)
 
if __name__ == "__main__":
    generate_mapping()

This script generates a mapping.json file that the sync.py script will use to synchronize the roles between the event and manufacturing environments. The mapping.json file incorporates the mapping of the event setting belongings to the manufacturing setting belongings:

The s3://aws-glue-assets-* Amazon Easy Storage Service (Amazon S3) bucket incorporates the AWS Glue Studio visible job definitions
The arn:aws:iam::*:position/service-role/AWSGlueServiceRole AWS Identification and Entry Administration (IAM) position is utilized by the AWS Glue Studio jobs to entry AWS sources
The s3://dev-glue-data-* and s3://prod-glue-data-* S3 buckets comprise scripts and knowledge utilized by the AWS Glue Studio jobs

Replace the aws_glue_cdk_baseline/pipeline_stack.py file to incorporate a step that deserializes the JSON file and deploys the AWS Glue jobs to the manufacturing setting:

from typing import Dict
import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_iam as iam
)
from constructs import Assemble
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, CodeBuildStep
from aws_glue_cdk_baseline.glue_app_stage import GlueAppStage
 
GITHUB_REPO = "YOUR-GITHUB-USERNAME/aws-glue-cdk-baseline"
GITHUB_BRANCH = "predominant"
GITHUB_CONNECTION_ARN = "YOUR-GITHUB-CONNECTION-ARN"
 
class PipelineStack(Stack):
 
    def __init__(self, scope: Assemble, construct_id: str, config: Dict, **kwargs) -> None:
        tremendous().__init__(scope, construct_id, **kwargs)
 
        supply = CodePipelineSource.connection(
            GITHUB_REPO,
            GITHUB_BRANCH,
            connection_arn=GITHUB_CONNECTION_ARN
        )
 
        pipeline = CodePipeline(self, "GluePipeline",
            pipeline_name="GluePipeline",
            cross_account_keys=True,
            docker_enabled_for_synth=True,
            synth=CodeBuildStep("CdkSynth",
                enter=supply,
                install_commands=[
                    "pip install -r requirements.txt",
                    "pip install -r requirements-dev.txt",
                    "npm install -g aws-cdk",
                ],
                instructions=[
                    "cdk synth",
                ]
            )
        )
 
        # Add growth stage
        dev_stage = GlueAppStage(self, "DevStage", config=config, stage="dev", 
            env=cdk.Atmosphere(
                account=str(config['devAccount']['awsAccountId']),
                area=config['devAccount']['awsRegion']
            ))
        pipeline.add_stage(dev_stage)

        # Add manufacturing stage
        prod_stage = GlueAppStage(self, "ProdStage", config=config, stage="prod", 
            env=cdk.Atmosphere(
                account=str(config['prodAccount']['awsAccountId']),
                area=config['prodAccount']['awsRegion']
            ))
        pipeline.add_stage(prod_stage)
 
        # Glue Useful resource Sync as a separate step within the pipeline
        pipeline.add_wave("GlueJobSync").add_post(CodeBuildStep("GlueJobSync",
            enter=supply,
            instructions=[
                "python $(pwd)/aws_glue_cdk_baseline/job_scripts/generate_mapping.py",
                "python aws_glue_cdk_baseline/job_scripts/sync.py "
                   "--dst-role-arn arn:aws:iam::{0}:role/GlueCrossAccountRole-prod "
                   "--dst-region {1} "
                   "--deserialize-from-file aws_glue_cdk_baseline/resources/resources.json "
                   "--config-path mapping.json "
                   "--targets job,catalog "
                   "--skip-prompt".format(
                       config['prodAccount']['awsAccountId'],
                       config['prodAccount']['awsRegion']
                   ),
            ],
            role_policy_statements=[
                iam.PolicyStatement(
                    actions=[
                        "sts:AssumeRole",
                    ],
                    sources=["*"]
                )
            ]
        ))

Substitute the placeholders within the pipeline_stack.py file along with your values:

GITHUB_REPO with the title of your GitHub repository
GITHUB_BRANCH with the title of the department you wish to use for the pipeline
GITHUB_CONNECTION_ARN with the ARN of the GitHub connection you created in Step 4

Replace the aws_glue_cdk_baseline/glue_app_stack.py file to create a cross-account position with the required permissions to entry the event setting sources:

    self.cross_account_role = self.create_cross_account_role(
        f"GlueCrossAccountRole-{stage}",
        str(config['pipelineAccount']['awsAccountId'])
    )
 
    def create_cross_account_role(self, role_name: str, trusted_account_id: str):
        return iam.Position(self, f"{role_name}CrossAccountRole",
            role_name=role_name,
            assumed_by=iam.AccountPrincipal(trusted_account_id),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess")]
        )
 
    @property
    def cross_account_role_arn(self):
        return self.cross_account_role.role_arn

    @property
    def cross_account_role_arn(self):
        return self.glue_app_stack.cross_account_role_arn

Examine the andreimaksimov/aws-glue-cdk-baseline for a whole diff.

Commit your modifications to the repository:

git add aws_glue_cdk_baseline/job_scripts/sync.py
git add aws_glue_cdk_baseline/job_scripts/generate_mapping.py
git add pipeline_stack.py

git commit -m "Combine Glue Useful resource Sync Utility into the pipeline"

git push

Day-to-day growth (repeated)

With the preliminary setup full, now you can proceed along with your common growth actions. This part outlines the steps you’ll repeat throughout your day-to-day work to develop, model management, and deploy AWS Glue visible jobs.

Create visible jobs within the growth account

On this step, you’ll use AWS Glue Studio to create and configure your visible jobs inside the growth setting.

In your growth account, in AWS Glue Studio, choose AWS Glue Studio
To create a brand new visible job, select Create job
Select Visible with a clean canvas and use the visible editor to design your ETL job
Configure the job settings:
Job title: Present a significant title
IAM position: Choose an IAM position with needed permissions
Different configurations: Alter as wanted
To save lots of the job, select Save

Repeat these steps to create extra jobs as required.

Serialize visible jobs

To serialize your visible jobs to allow model management and preparation for deployment, comply with these steps:

Run the AWS Glue Useful resource Sync Utility:

python sync.py 
  --src-role-arn arn:aws:iam::<DEV-ACCOUNT-NUMBER>:position/GlueCrossAccountRole-dev 
  --src-region us-east-1 
  --serialize-to-file sources.json 
  --targets job,catalog 
  --skip-prompt

Substitute <DEV-ACCOUNT-NUMBER> along with your growth account quantity
Substitute <DEV-REGION> along with your growth Area (for instance, us-east-1)
Confirm the serialized file:
Find JSON in aws_glue_cdk_baseline/sources/
Be certain that it incorporates the definitions of your visible jobs

Commit modifications to Git repository

To commit modifications to the Git repository, comply with these steps:

Add the serialized sources to Git:

git add aws_glue_cdk_baseline/sources/sources.json

Commit your modifications:

git commit -m "Add serialized Glue Visible Jobs"

Push to GitHub:

This motion triggers the CI/CD pipeline.

Deploy visible jobs to manufacturing

The CI/CD pipeline routinely deploys the next modifications:

Synthesize the AWS CDK software
Deploy to the event setting
Deploy to the manufacturing setting
Execute the AWS Glue Useful resource Sync Utility

The next screenshot exhibits the CI/CD pipeline.

CICD Pipeline

Confirm visible jobs in manufacturing

After the pipeline has accomplished the deployment, it’s vital to confirm that the visible jobs are accurately mirrored within the manufacturing setting. To take action, comply with these steps:

Within the manufacturing account, on the AWS Glue Studio console, choose AWS Glue Studio
Confirm the deployed jobs:
Make it possible for the visible jobs are current
Open every job to substantiate that the visible DAGs are preserved

By following these steps in your day-to-day workflow, you ensure that your AWS Glue visible jobs are version-controlled, constant throughout environments, and that your manufacturing setting displays the most recent examined modifications.

Model management for AWS Glue visible jobs

By serializing AWS Glue Studio visible jobs to JSON recordsdata and committing them to a Git repository, you allow model management on your knowledge integration workflows. By following this strategy you’ll be able to:

Monitor Modifications – Monitor modifications to your AWS Glue jobs over time
Collaborate – Work with group members on creating and refining jobs
Restore and deploy – Simply restore jobs in different accounts or environments

The serialization and deserialization steps are integral to your growth and deployment course of, ensuring that each one modifications are captured and seamlessly propagated.

Conclusion

By combining the AWS Glue Visible Job API, AWS Glue Useful resource Sync Utility, and an AWS CDK based mostly CI/CD pipeline, we’ve crafted a complete resolution for managing AWS Glue Studio visible jobs throughout completely different environments. This built-in strategy provides a number of advantages:

Model management integration – Handle and observe modifications to your AWS Glue visible jobs utilizing Git, enabling collaboration and alter monitoring
Streamlined growth – Simply develop and check AWS Glue jobs utilizing the Visible Editor within the growth setting
Automated deployment – Use a CI/CD pipeline to routinely deploy serialized visible DAGs to the manufacturing setting
Atmosphere consistency – Promote consistency throughout growth and manufacturing environments through the use of the identical job definitions
Visible illustration preservation – Preserve the visible DAG illustration when synchronizing jobs between environments

This resolution empowers knowledge engineers to give attention to constructing strong knowledge integration pipelines whereas automating the complexities of managing and deploying AWS Glue Studio visible jobs throughout a number of environments.

We encourage you to do that resolution and adapt it to your wants. As at all times, we welcome your suggestions and strategies for additional enhancements.

Concerning the Authors

Andrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software program growth, and DevOps. He designs and implements scalable, safe, and environment friendly cloud options and helps clients optimize their cloud environments. Exterior of labor, Andrei enjoys collaborating in hackathons, contributing to open supply initiatives, and exploring the most recent developments in AI. You may join with him on LinkedIn.

David Zhang is an AWS Information Architect specializing in designing and implementing analytics infrastructure, knowledge administration, ETL, and intensive knowledge programs. He helps clients modernize their AWS knowledge platforms. David can be an energetic speaker at AWS conferences and contributor to AWS conferences, technical content material, and open supply initiatives. He enjoys taking part in volleyball, tennis, and weightlifting in his free time. Be at liberty to attach with him on LinkedIn.

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue group. He’s accountable for designing AWS options, implementing software program artifacts, and serving to with buyer architectures. In his spare time, he enjoys watching anime on Prime Video. You may join with him on LinkedIn.