Prospects who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) usually want Python dependencies which might be hosted in non-public code repositories. Many purchasers go forĀ public community entry modeĀ for its ease of use and talent to make outbound Web requests, all whereas sustaining safe entry. Nonetheless, non-public code repositories is probably not accessible through the Web.Ā Itās additionally a finest apply to solely set up Python dependencies the place they’re wanted. You need to use Amazon MWAA startup scripts to selectively set up Python dependencies required for working code on staff, whereas avoiding points on account of net server restrictions.
This submit demonstrates a technique to selectively set up Python dependencies primarily based on the Amazon MWAA part sort (net server, scheduler, or employee) from a Git repository solely accessible out of your digital non-public cloud (VPC).
Answer overview
This resolution focuses on utilizing a non-public Git repository to selectively set up Python dependencies, though you should utilize the identical sample demonstrated on this submit with non-public Python bundle indexes comparable to AWS CodeArtifact. For extra data, discuss with Amazon MWAA with AWS CodeArtifact for Python dependencies.
The Amazon MWAA structure permits you to select a net server entry mode to manage whether or not the net server is accessible from the web or solely out of your VPC. You too can management whether or not your staff, scheduler, and net servers have entry to the web by means of your buyer VPC configuration. On this submit, we display an atmosphere such because the one proven within the following diagram, the place the atmosphere is utilizing public community entry mode for the net servers, and the Apache Airflow staff and schedulers donāt have a path to the web out of your VPC.
There are as much as 4 potential networking configurations for an Amazon MWAA atmosphere:
- Public routing and public net server entry mode
- Personal routing and public net server entry mode (pictured within the previous diagram)
- Public routing and personal net server entry mode
- Personal routing and personal net server entry mode
We deal with one networking configuration for this submit, however the basic ideas are relevant for any networking configuration.
The answer we stroll by means of depends on the truth that Amazon MWAA runs a startup script (startup.sh
) throughout startup on each particular person Apache Airflow part (employee, scheduler, and net server) earlier than putting in necessities (necessities.txt
) and initializing the Apache Airflow course of. This startup script is used to set an atmosphere variable, which is then referenced within the necessities.txt file to selectively set up libraries.
The next steps enable us to perform this:
- Create and set up the startup script (
startup.sh
) within the Amazon MWAA atmosphere. This script units the atmosphere variable for selectively putting in dependencies. - Create and set up world Python dependencies (
necessities.txt
) within the Amazon MWAA atmosphere. This file comprises the worldwide dependencies required by all Amazon MWAA elements. - Create and set up component-specific Python dependencies within the Amazon MWAA atmosphere. This step entails creating separate necessities information for every part sort (employee, scheduler, net server) to selectively set up the mandatory dependencies.
Conditions
For this walkthrough, it is best to have the next stipulations:
- An AWS account
- An Amazon MWAA atmosphere deployed with public entry mode for the net server
- Versioning enabled on your Amazon MWAA atmosphereās Amazon Easy Storage Service (Amazon S3) bucket
- Amazon CloudWatch logging enabled on the INFO stage for employee and net server
- A Git repository accessible from inside your VPC
Moreover, we add a pattern Python bundle to the Git repository:
Create and set up the startup script within the Amazon MWAA atmosphere
Create the startup.sh file utilizing the next instance code:
Add startup.sh to the S3 bucket on your Amazon MWAA atmosphere:
Browse the CloudWatch log streams on your staff and consider the worker_console log. Discover the startup script is now working and setting the atmosphere variable.
Create and set up world Python dependencies within the Amazon MWAA atmosphere
Your necessities file should embrace a āconstraint assertion to verify the packages listed in your necessities are appropriate with the model of Apache Airflow you might be utilizing. The assertion starting with -r
references the atmosphere variable you set in your startup.sh
script primarily based on the part sort.
The next code is an instance of the necessities.txt
file:
Add the necessities.txt file to the Amazon MWAA atmosphere S3 bucket:
Create and set up component-specific Python dependencies within the Amazon MWAA atmosphere
For this instance, we wish to set up the Python bundle scrapy on staff and schedulers from our non-public Git repository. We additionally wish to set up pprintpp on the internet server from the general public Python packages indexes. To perform that, we have to create the next information (we offer instance code):
Add webserver_reqs.txt
, scheduler_reqs.txt
, and worker_reqs.txt
to the DAGs folder for the Amazon MWAA atmosphere:
Replace the atmosphere for the brand new necessities file and observe the outcomes
Get the newest object model for the necessities file:
Replace the Amazon MWAA atmosphere to make use of the brand new necessities.txt
file:
Browse the CloudWatch log streams on your staff and consider the requirements_install
log. Discover the startup script is now working and setting the atmosphere variable.
Conclusion
On this submit, we demonstrated a technique to selectively set up Python dependencies primarily based on the Amazon MWAA part sort (net server, scheduler, or employee) from a Git repository solely accessible out of your VPC.
We hope this submit supplied you with a greater understanding of how startup scripts and Python dependency administration work in an Amazon MWAA atmosphere. You may implement different variations and configurations utilizing the ideas outlined on this submit, relying in your particular community setup and necessities.
Concerning the Creator
Tim Wilhoit is a Sr. Options Architect for the Division of Protection at AWS. Tim has over 20 years of enterprise IT expertise. His areas of curiosity are serverless computing and ML/AI. In his spare time, Tim enjoys spending time on the lake and rooting on the Oklahoma State Cowboys. Go Pokes!