9.1 C
United States of America
Sunday, November 24, 2024

Entry non-public code repositories for putting in Python dependencies on Amazon MWAA


Prospects who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) usually want Python dependencies which might be hosted in non-public code repositories. Many purchasers go forĀ public community entry modeĀ for its ease of use and talent to make outbound Web requests, all whereas sustaining safe entry. Nonetheless, non-public code repositories is probably not accessible through the Web.Ā Itā€™s additionally a finest apply to solely set up Python dependencies the place they’re wanted. You need to use Amazon MWAA startup scripts to selectively set up Python dependencies required for working code on staff, whereas avoiding points on account of net server restrictions.

This submit demonstrates a technique to selectively set up Python dependencies primarily based on the Amazon MWAA part sort (net server, scheduler, or employee) from a Git repository solely accessible out of your digital non-public cloud (VPC).

Answer overview

This resolution focuses on utilizing a non-public Git repository to selectively set up Python dependencies, though you should utilize the identical sample demonstrated on this submit with non-public Python bundle indexes comparable to AWS CodeArtifact. For extra data, discuss with Amazon MWAA with AWS CodeArtifact for Python dependencies.

The Amazon MWAA structure permits you to select a net server entry mode to manage whether or not the net server is accessible from the web or solely out of your VPC. You too can management whether or not your staff, scheduler, and net servers have entry to the web by means of your buyer VPC configuration. On this submit, we display an atmosphere such because the one proven within the following diagram, the place the atmosphere is utilizing public community entry mode for the net servers, and the Apache Airflow staff and schedulers donā€™t have a path to the web out of your VPC.

mwaa-architecture

There are as much as 4 potential networking configurations for an Amazon MWAA atmosphere:

  • Public routing and public net server entry mode
  • Personal routing and public net server entry mode (pictured within the previous diagram)
  • Public routing and personal net server entry mode
  • Personal routing and personal net server entry mode

We deal with one networking configuration for this submit, however the basic ideas are relevant for any networking configuration.

The answer we stroll by means of depends on the truth that Amazon MWAA runs a startup script (startup.sh) throughout startup on each particular person Apache Airflow part (employee, scheduler, and net server) earlier than putting in necessities (necessities.txt) and initializing the Apache Airflow course of. This startup script is used to set an atmosphere variable, which is then referenced within the necessities.txt file to selectively set up libraries.

The next steps enable us to perform this:

  1. Create and set up the startup script (startup.sh) within the Amazon MWAA atmosphere. This script units the atmosphere variable for selectively putting in dependencies.
  2. Create and set up world Python dependencies (necessities.txt) within the Amazon MWAA atmosphere. This file comprises the worldwide dependencies required by all Amazon MWAA elements.
  3. Create and set up component-specific Python dependencies within the Amazon MWAA atmosphere. This step entails creating separate necessities information for every part sort (employee, scheduler, net server) to selectively set up the mandatory dependencies.

Conditions

For this walkthrough, it is best to have the next stipulations:

  • An AWS account
  • An Amazon MWAA atmosphere deployed with public entry mode for the net server
  • Versioning enabled on your Amazon MWAA atmosphereā€™s Amazon Easy Storage Service (Amazon S3) bucket
  • Amazon CloudWatch logging enabled on the INFO stage for employee and net server
  • A Git repository accessible from inside your VPC

Moreover, we add a pattern Python bundle to the Git repository:

git clone https://github.com/scrapy/scrapy
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy scrapylocal
rm -rf ./scrapy/.git*
cp -r ./scrapy/* ./scrapylocal
cd scrapylocal
git add --all
git commit -m "first commit"
git push

Create and set up the startup script within the Amazon MWAA atmosphere

Create the startup.sh file utilizing the next instance code:

#!/bin/sh

echo "Printing Apache Airflow part"
echo $MWAA_AIRFLOW_COMPONENT

if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
sudo yum -y set up libaio
fi
if [[ "${MWAA_AIRFLOW_COMPONENT}" == "webserver" ]]
then
echo "Setting prolonged python necessities for webservers"
export EXTENDED_REQUIREMENTS="webserver_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "worker" ]]
then
echo "Setting prolonged python necessities for staff"
export EXTENDED_REQUIREMENTS="worker_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "scheduler" ]]
then
echo "Setting prolonged python necessities for schedulers"
export EXTENDED_REQUIREMENTS="scheduler_reqs.txt"
fi

Add startup.sh to the S3 bucket on your Amazon MWAA atmosphere:

aws s3 cp startup.sh s3://[mwaa-environment-bucket]
aws mwaa update-environment --startup-script-s3-path s3://[mwaa-environment-bucket]/startup.sh

Browse the CloudWatch log streams on your staff and consider the worker_console log. Discover the startup script is now working and setting the atmosphere variable.

log-startup-script

Create and set up world Python dependencies within the Amazon MWAA atmosphere

Your necessities file should embrace a ā€“constraint assertion to verify the packages listed in your necessities are appropriate with the model of Apache Airflow you might be utilizing. The assertion starting with -r references the atmosphere variable you set in your startup.sh script primarily based on the part sort.

The next code is an instance of the necessities.txt file:

--constraint https://uncooked.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt
-r /usr/native/airflow/dags/${EXTENDED_REQUIREMENTS}

Add the necessities.txt file to the Amazon MWAA atmosphere S3 bucket:

aws s3 cp necessities.txt s3://[mwaa-environment-bucket]

Create and set up component-specific Python dependencies within the Amazon MWAA atmosphere

For this instance, we wish to set up the Python bundle scrapy on staff and schedulers from our non-public Git repository. We additionally wish to set up pprintpp on the internet server from the general public Python packages indexes. To perform that, we have to create the next information (we offer instance code):

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

Add webserver_reqs.txt, scheduler_reqs.txt, and worker_reqs.txt to the DAGs folder for the Amazon MWAA atmosphere:

aws s3 cp webserver_reqs.txt s3://mwaa-environment/dags
aws s3 cp scheduler_reqs.txt s3://mwaa-environment/dags
aws s3 cp worker_reqs.txt s3://mwaa-environment/dags

Replace the atmosphere for the brand new necessities file and observe the outcomes

Get the newest object model for the necessities file:

aws s3api list-object-versions --bucket [mwaa-environment-bucket]

Replace the Amazon MWAA atmosphere to make use of the brand new necessities.txt file:

aws mwaa update-environment --name [mwaa-environment-name] --requirements-s3-object-version [s3-object-version]

Browse the CloudWatch log streams on your staff and consider the requirements_install log. Discover the startup script is now working and setting the atmosphere variable.

log-requirements

log-git

Conclusion

On this submit, we demonstrated a technique to selectively set up Python dependencies primarily based on the Amazon MWAA part sort (net server, scheduler, or employee) from a Git repository solely accessible out of your VPC.

We hope this submit supplied you with a greater understanding of how startup scripts and Python dependency administration work in an Amazon MWAA atmosphere. You may implement different variations and configurations utilizing the ideas outlined on this submit, relying in your particular community setup and necessities.


Concerning the Creator

Tim Wilhoit is a Sr. Options Architect for the Division of Protection at AWS. Tim has over 20 years of enterprise IT expertise. His areas of curiosity are serverless computing and ML/AI. In his spare time, Tim enjoys spending time on the lake and rooting on the Oklahoma State Cowboys. Go Pokes!

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles