AWS Glue is a serverless, scalable information integration service that makes it easy to find, put together, transfer, and combine information from a number of sources. At this time, we’re launching AWS Glue 5.0, a brand new model of AWS Glue that accelerates information integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, providing you with newer Spark and Python releases so you possibly can develop, run, and scale your information integration workloads and get insights quicker.
This submit describes what’s new in AWS Glue 5.0, efficiency enhancements, key highlights on Spark and associated libraries, and how you can get began on AWS Glue 5.0.
What’s new in AWS Glue 5.0
AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new efficiency and safety enhancements from the open supply. AWS Glue 5.0 additionally updates help for open desk format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you possibly can clear up superior use circumstances round efficiency, value, governance, and privateness in your information lakes. AWS Glue 5.0 provides help for Spark-native fine-grained entry management with AWS Lake Formation so you possibly can apply table- and column-level permissions on an Amazon Easy Storage Service (Amazon S3) information lake for write operations (equivalent to INSERT INTO and INSERT OVERWRITE) with Spark jobs.
Key options embody:
- Amazon SageMaker Unified Studio help
- Amazon SageMaker Lakehouse help
- Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
- Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
- Spark-native fine-grained entry management utilizing Lake Formation
- Amazon S3 Entry Grants help
necessities.txt
help to put in extra Python libraries- Knowledge lineage help in Amazon DataZone
Amazon SageMaker Unified Studio help
Amazon SageMaker Unified Studio helps AWS Glue 5.0 for compute runtime of unified notebooks and visible ETL circulation editor.
Amazon SageMaker Lakehouse help
Glue 5.0 helps native integration with Amazon SageMaker Lakehouse to allow unified entry throughout Amazon Redshift information warehouses and S3 information lakes.
Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 makes use of AWS efficiency optimized Spark runtime, 3.9 instances quicker than open supply Spark. Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%.
For extra particulars about up to date library dependencies, see Dependent library upgrades part.
Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
AWS Glue 5.0 upgrades the open desk format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1.
Spark-native fine-grained entry management utilizing Lake Formation
AWS Glue helps AWS Lake Formation High quality Grained Entry Management (FGAC) via native Spark DataFrames and Spark SQL.
S3 Entry Grants help
S3 Entry Grants gives a simplified mannequin for outlining entry permissions to information in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 helps S3 Entry Grants via EMR File System (EMRFS) utilizing extra Spark configurations:
- Key:
--conf
- Worth:
hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false
To study extra, confer with documentation.
necessities.txt help to put in extra Python libraries
In AWS Glue 5.0, you possibly can present the usual necessities.txt
file to handle Python library dependencies. To do this, present the next job parameters:
- Parameter 1:
- Key:
--python-modules-installer-option
- Worth:
-r
- Key:
- Parameter 2:
- Key:
--additional-python-modules
- Worth:
s3://path_to_requirements.txt
- Key:
AWS Glue 5.0 nodes initially load Python libraries laid out in necessities.txt
. The next code illustrates the pattern necessities.txt
:
Knowledge lineage help in Amazon DataZone (preview)
AWS Glue 5.0 helps information lineage in Amazon DataZone in preview. You possibly can configure AWS Glue to robotically acquire lineage data throughout Spark job runs and ship the lineage occasions to be visualized in Amazon DataZone.
To configure this on the AWS Glue console, allow Generate lineage occasions, and enter your Amazon DataZone area ID on the Job particulars tab.
Alternatively, you possibly can present the next job parameter (present your DataZone area ID):
- Key:
--conf
- Worth:
extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.kind=amazon_datazone_api —conf spark.openlineage.transport.domainId=<Your-Area-ID>
Be taught extra in Amazon DataZone introduces OpenLineage-compatible information lineage visualization in preview.
Improved efficiency
AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%. The next chart reveals the full job runtime for all queries (in seconds) within the 3 TB question dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is positioned in an S3 bucket in Parquet format, and we used 30 G.2X employees in AWS Glue. We noticed that our AWS Glue 5.0 TPC-DS checks on Amazon S3 was 58% quicker than that on AWS Glue 4.0 whereas decreasing value by 36%.
. | AWS Glue 4.0 | AWS Glue 5.0 |
Whole Question Time (seconds) | 1896.1904 | 1197.78755 |
Geometric Imply (seconds) | 10.09472 | 6.82208 |
Estimated Price ($) | 45.85533 | 29.20133 |
The next graphs illustrates the comparisons of efficiency and price.
Dependent library upgrades
The next desk lists dependency upgrades.
Dependency | Model in AWS Glue 4.0 | Model in AWS Glue 5.0 |
Spark | 3.3.0 | 3.5.2 |
Hadoop | 3.3.3 | 3.4.0 |
Scala | 2.12 | 2.12.18 |
Hive | 2.3.9 | 2.3.9 |
EMRFS | 2.54.0 | 2.66.0 |
Arrow | 7.0.0 | 12.0.1 |
Iceberg | 1.0.0 | 1.6.1 |
Hudi | 0.12.1 | 0.15.0 |
Delta Lake | 2.1.0 | 3.2.1 |
Java | 8 | 17 |
Python | 3.10 | 3.11 |
boto3 | 1.26 | 1.34.131 |
AWS SDK for Java | 1.12 | 2.28.8 |
AWS Glue Knowledge Catalog Consumer | 3.7.0 | 4.2.0 |
EMR DynamoDB Connector | 4.16.0 | 5.6.0 |
The next desk lists database connector (JDBC driver) upgrades.
Driver | Connector Model in AWS Glue 4.0 | Connector Model in AWS Glue 5.0 |
MySQL | 8.0.23 | 8.0.33 |
Microsoft SQL Server | 9.4.0 | 10.2.0 |
Oracle Databases | 21.7 | 23.3.0.23.09 |
PostgreSQL | 42.3.6 | 42.7.3 |
Amazon Redshift | redshift-jdbc42-2.1.0.16 | redshift-jdbc42-2.1.0.29 |
The next are Spark connector upgrades:
Driver | Connector Model in AWS Glue 4.0 | Connector Model in AWS Glue 5.0 |
Amazon Redshift | 6.1.3 | 6.3.0 |
OpenSearch | 1.0.1 | 1.2.0 |
MongoDB | 10.0.4 | 10.3.0 |
Snowflake | 2.12.0 | 3.0.0 |
BigQuery | 0.32.2 | 0.32.2 |
Apache Spark highlights
Spark 3.5.2 in AWS Glue 5.0 brings plenty of priceless options, which we spotlight on this part. To study extra in regards to the highlights and enhancements of Spark 3.4 and three.5, confer with Spark Launch 3.4.0 and Spark Launch 3.5.0.
Apache Arrow-optimized Python UDF
Python user-defined capabilities (UDFs) allow customers to construct customized code for information processing wants, offering flexibility and accessibility. Nevertheless, efficiency suffers as a result of UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs clear up this by protecting information in shared reminiscence utilizing Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs environment friendly for large-scale processing.
To make use of Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled
to True
.
Python user-defined desk capabilities
A user-defined desk perform (UDTF) is a perform that returns a whole output desk as an alternative of a single worth. PySpark customers can now write customized UDTFs with Python logic and use them in PySpark and SQL queries. Referred to as within the FROM
clause, UDTFs can settle for zero or extra arguments, both as scalar expressions or desk arguments. The UDTF’s return kind, outlined as both a StructType
(for instance, StructType().add("c1", StringType())
) or DDL string (for instance, c1: string
), determines the output desk’s schema.
RocksDB state retailer enhancement
At Spark 3.2, RocksDB state retailer supplier has been added as a built-in state retailer implementation.
Changelog checkpointing
A brand new checkpoint mechanism for the RocksDB state retailer supplier known as changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby decreasing end-to-end latency considerably.
You possibly can allow this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled
to True
.
You can too allow this characteristic with present checkpoints.
Reminiscence administration enhancements
Though the RocksDB state retailer supplier is well-known to be helpful to handle reminiscence points on the state, there was no fine-grained reminiscence administration. Spark 3.5 introduces extra fine-grained reminiscence administration, which allows customers to cap the full reminiscence utilization throughout RocksDB situations in the identical executor course of, enabling customers to configure the reminiscence utilization per executor course of.
Enhanced Structured Streaming
Spark 3.4 and three.5 have many enhancements associated to Spark Structured Streaming.
This new API deduplicates rows based mostly on sure occasions. Watermark-based processing permits for extra exact management over late information dealing with:
- Deduplicate the identical rows:
dropDuplicatesWithinWatermark()
- Deduplicate values on ‘worth’ columns:
dropDuplicatesWithinWatermark(['value'])
- Deduplicate utilizing the
guid
column with a watermark based mostly on theeventTime
column:withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])
Get began with AWS Glue 5.0
You can begin utilizing AWS Glue 5.0 via AWS Glue Studio, the AWS Glue console, the newest AWS SDK, and the AWS Command Line Interface (AWS CLI).
To begin utilizing AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Particulars tab, select the model Glue 5.0 – Helps Spark 3.5, Scala 2, Python 3.
To begin utilizing AWS Glue 5.0 on an AWS Glue Studio pocket book or an interactive session via a Jupyter pocket book, set 5.0 within the %glue_version
magic:
The next output reveals that the session is ready to make use of AWS Glue 5.0:
Conclusion
On this submit, we mentioned the important thing options and advantages of AWS Glue 5.0. You possibly can create new AWS Glue jobs on AWS Glue 5.0 to get the profit from the enhancements, or migrate your present AWS Glue jobs.
We want to thank the help of quite a few engineers and leaders who helped construct Glue 5.0 that permits clients with a efficiency optimized Spark runtime and several other new capabilities.
In regards to the Authors
Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue workforce. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his highway bike.
Stuti Deshpande is a Large Knowledge Specialist Options Architect at AWS. She works with clients across the globe, offering them strategic and architectural steerage on implementing analytics options utilizing AWS. She has intensive expertise in large information, ETL, and analytics. In her free time, Stuti likes to journey, study new dance varieties, and luxuriate in high quality time with household and pals.
Martin Ma is a Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with enhancing the client expertise by making use of problem-solving abilities to invent new software program options, in addition to consistently trying to find methods to simplify present ones. In his spare time, he enjoys singing and enjoying the guitar.
Anshul Sharma is a Software program Growth Engineer in AWS Glue Workforce.
Rajendra Gujja is a Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with distributed computing and all the pieces and something about information.
Maheedhar Reddy Chappidi is a Sr. Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with constructing fault tolerant and dependable distributed programs at scale. Exterior of his work, Maheedhar is obsessed with listening to podcasts and enjoying along with his two-year-old child.
Matt Su is a Senior Product Supervisor on the AWS Glue workforce. He enjoys serving to clients uncover insights and make higher choices utilizing their information with AWS Analytics companies. In his spare time, he enjoys snowboarding and gardening.
Savio Dsouza is a Software program Growth Supervisor on the AWS Glue workforce. His workforce works on generative AI purposes for the Knowledge Integration area and distributed programs for effectively managing information lakes on AWS and optimizing Apache Spark for efficiency and reliability.
Kartik Panjabi is a Software program Growth Supervisor on the AWS Glue workforce. His workforce builds generative AI options for the Knowledge Integration and distributed system for information integration.
Mohit Saxena is a Senior Software program Growth Supervisor on the AWS Glue and Amazon EMR workforce. His workforce focuses on constructing distributed programs to allow clients with simple-to-use interfaces and AI-driven capabilities to effectively remodel petabytes of knowledge throughout information lakes on Amazon S3, and databases and information warehouses on the cloud.