Introducing AWS Glue 5.0 for Apache Spark

December 6, 2024

35

AWS Glue is a serverless, scalable information integration service that makes it easy to find, put together, transfer, and combine information from a number of sources. At this time, we’re launching AWS Glue 5.0, a brand new model of AWS Glue that accelerates information integration workloads in AWS. AWS Glue 5.0 upgrades the Spark engines to Apache Spark 3.5.2 and Python 3.11, providing you with newer Spark and Python releases so you possibly can develop, run, and scale your information integration workloads and get insights quicker.

This submit describes what’s new in AWS Glue 5.0, efficiency enhancements, key highlights on Spark and associated libraries, and how you can get began on AWS Glue 5.0.

What’s new in AWS Glue 5.0

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, and Java 17 with new efficiency and safety enhancements from the open supply. AWS Glue 5.0 additionally updates help for open desk format libraries to Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1 so you possibly can clear up superior use circumstances round efficiency, value, governance, and privateness in your information lakes. AWS Glue 5.0 provides help for Spark-native fine-grained entry management with AWS Lake Formation so you possibly can apply table- and column-level permissions on an Amazon Easy Storage Service (Amazon S3) information lake for write operations (equivalent to INSERT INTO and INSERT OVERWRITE) with Spark jobs.

Key options embody:

Amazon SageMaker Unified Studio help
Amazon SageMaker Lakehouse help
Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17
Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1
Spark-native fine-grained entry management utilizing Lake Formation
Amazon S3 Entry Grants help
necessities.txt help to put in extra Python libraries
Knowledge lineage help in Amazon DataZone

Amazon SageMaker Unified Studio help

Amazon SageMaker Unified Studio helps AWS Glue 5.0 for compute runtime of unified notebooks and visible ETL circulation editor.

Amazon SageMaker Lakehouse help

Glue 5.0 helps native integration with Amazon SageMaker Lakehouse to allow unified entry throughout Amazon Redshift information warehouses and S3 information lakes.

Frameworks up to date to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17

AWS Glue 5.0 upgrades the runtimes to Spark 3.5.2, Python 3.11, Scala 2.12.18, and Java 17. Glue 5.0 makes use of AWS efficiency optimized Spark runtime, 3.9 instances quicker than open supply Spark. Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%.

For extra particulars about up to date library dependencies, see Dependent library upgrades part.

Open Desk Codecs (OTF) up to date to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1

AWS Glue 5.0 upgrades the open desk format libraries to Hudi 0.15.0, Iceberg 1.6.1, and Delta Lake 3.2.1.

Spark-native fine-grained entry management utilizing Lake Formation

AWS Glue helps AWS Lake Formation High quality Grained Entry Management (FGAC) via native Spark DataFrames and Spark SQL.

S3 Entry Grants help

S3 Entry Grants gives a simplified mannequin for outlining entry permissions to information in Amazon S3 by prefix, bucket, or object. AWS Glue 5.0 helps S3 Entry Grants via EMR File System (EMRFS) utilizing extra Spark configurations:

Key: --conf
Worth: hadoop.fs.s3.s3AccessGrants.enabled=true --conf spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM=false

To study extra, confer with documentation.

necessities.txt help to put in extra Python libraries

In AWS Glue 5.0, you possibly can present the usual necessities.txt file to handle Python library dependencies. To do this, present the next job parameters:

Parameter 1:
- Key: --python-modules-installer-option
- Worth: -r
Parameter 2:
- Key: --additional-python-modules
- Worth: s3://path_to_requirements.txt

AWS Glue 5.0 nodes initially load Python libraries laid out in necessities.txt. The next code illustrates the pattern necessities.txt:

awswrangler==3.9.1 
elasticsearch==8.15.1
PyAthena==3.9.0
PyMySQL==1.1.1
PyYAML==6.0.2
pyodbc==5.2.0
pyorc==0.9.0 
redshift-connector==2.1.3
scipy==1.14.1
scikit-learn==1.5.2
SQLAlchemy==2.0.36

Knowledge lineage help in Amazon DataZone (preview)

AWS Glue 5.0 helps information lineage in Amazon DataZone in preview. You possibly can configure AWS Glue to robotically acquire lineage data throughout Spark job runs and ship the lineage occasions to be visualized in Amazon DataZone.

To configure this on the AWS Glue console, allow Generate lineage occasions, and enter your Amazon DataZone area ID on the Job particulars tab.

Alternatively, you possibly can present the next job parameter (present your DataZone area ID):

Key: --conf
Worth: extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener —conf spark.openlineage.transport.kind=amazon_datazone_api —conf spark.openlineage.transport.domainId=<Your-Area-ID>

Be taught extra in Amazon DataZone introduces OpenLineage-compatible information lineage visualization in preview.

Improved efficiency

AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. AWS Glue 5.0 is 32% quicker than AWS Glue 4.0 and reduces prices by 22%. The next chart reveals the full job runtime for all queries (in seconds) within the 3 TB question dataset between AWS Glue 4.0 and AWS Glue 5.0. The TPC-DS dataset is positioned in an S3 bucket in Parquet format, and we used 30 G.2X employees in AWS Glue. We noticed that our AWS Glue 5.0 TPC-DS checks on Amazon S3 was 58% quicker than that on AWS Glue 4.0 whereas decreasing value by 36%.

.	AWS Glue 4.0	AWS Glue 5.0
Whole Question Time (seconds)	1896.1904	1197.78755
Geometric Imply (seconds)	10.09472	6.82208
Estimated Price ($)	45.85533	29.20133

The next graphs illustrates the comparisons of efficiency and price.

Dependent library upgrades

The next desk lists dependency upgrades.

Dependency	Model in AWS Glue 4.0	Model in AWS Glue 5.0
Spark	3.3.0	3.5.2
Hadoop	3.3.3	3.4.0
Scala	2.12	2.12.18
Hive	2.3.9	2.3.9
EMRFS	2.54.0	2.66.0
Arrow	7.0.0	12.0.1
Iceberg	1.0.0	1.6.1
Hudi	0.12.1	0.15.0
Delta Lake	2.1.0	3.2.1
Java	8	17
Python	3.10	3.11
boto3	1.26	1.34.131
AWS SDK for Java	1.12	2.28.8
AWS Glue Knowledge Catalog Consumer	3.7.0	4.2.0
EMR DynamoDB Connector	4.16.0	5.6.0

The next desk lists database connector (JDBC driver) upgrades.

Driver	Connector Model in AWS Glue 4.0	Connector Model in AWS Glue 5.0
MySQL	8.0.23	8.0.33
Microsoft SQL Server	9.4.0	10.2.0
Oracle Databases	21.7	23.3.0.23.09
PostgreSQL	42.3.6	42.7.3
Amazon Redshift	redshift-jdbc42-2.1.0.16	redshift-jdbc42-2.1.0.29

The next are Spark connector upgrades:

Driver	Connector Model in AWS Glue 4.0	Connector Model in AWS Glue 5.0
Amazon Redshift	6.1.3	6.3.0
OpenSearch	1.0.1	1.2.0
MongoDB	10.0.4	10.3.0
Snowflake	2.12.0	3.0.0
BigQuery	0.32.2	0.32.2

Apache Spark highlights

Spark 3.5.2 in AWS Glue 5.0 brings plenty of priceless options, which we spotlight on this part. To study extra in regards to the highlights and enhancements of Spark 3.4 and three.5, confer with Spark Launch 3.4.0 and Spark Launch 3.5.0.

Apache Arrow-optimized Python UDF

Python user-defined capabilities (UDFs) allow customers to construct customized code for information processing wants, offering flexibility and accessibility. Nevertheless, efficiency suffers as a result of UDFs require serialization between Python and JVM processes. Spark 3.5’s Apache Arrow-optimized UDFs clear up this by protecting information in shared reminiscence utilizing Arrow’s high-performance columnar format, eliminating serialization overhead and making UDFs environment friendly for large-scale processing.

To make use of Arrow-optimized Python UDFs, set spark.sql.execution.pythonUDF.arrow.enabled to True.

Python user-defined desk capabilities

A user-defined desk perform (UDTF) is a perform that returns a whole output desk as an alternative of a single worth. PySpark customers can now write customized UDTFs with Python logic and use them in PySpark and SQL queries. Referred to as within the FROM clause, UDTFs can settle for zero or extra arguments, both as scalar expressions or desk arguments. The UDTF’s return kind, outlined as both a StructType (for instance, StructType().add("c1", StringType())) or DDL string (for instance, c1: string), determines the output desk’s schema.

RocksDB state retailer enhancement

At Spark 3.2, RocksDB state retailer supplier has been added as a built-in state retailer implementation.

Changelog checkpointing

A brand new checkpoint mechanism for the RocksDB state retailer supplier known as changelog checkpointing persists the changelog (updates) of the state. This reduces the commit latency, thereby decreasing end-to-end latency considerably.

You possibly can allow this by setting spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled to True.

You can too allow this characteristic with present checkpoints.

Reminiscence administration enhancements

Though the RocksDB state retailer supplier is well-known to be helpful to handle reminiscence points on the state, there was no fine-grained reminiscence administration. Spark 3.5 introduces extra fine-grained reminiscence administration, which allows customers to cap the full reminiscence utilization throughout RocksDB situations in the identical executor course of, enabling customers to configure the reminiscence utilization per executor course of.

Enhanced Structured Streaming

Spark 3.4 and three.5 have many enhancements associated to Spark Structured Streaming.

This new API deduplicates rows based mostly on sure occasions. Watermark-based processing permits for extra exact management over late information dealing with:

Deduplicate the identical rows: dropDuplicatesWithinWatermark()
Deduplicate values on ‘worth’ columns: dropDuplicatesWithinWatermark(['value'])
Deduplicate utilizing the guid column with a watermark based mostly on the eventTime column: withWatermark("eventTime", "10 hours") .dropDuplicatesWithinWatermark(["guid"])

Get began with AWS Glue 5.0

You can begin utilizing AWS Glue 5.0 via AWS Glue Studio, the AWS Glue console, the newest AWS SDK, and the AWS Command Line Interface (AWS CLI).

To begin utilizing AWS Glue 5.0 jobs in AWS Glue Studio, open the AWS Glue job and on the Job Particulars tab, select the model Glue 5.0 – Helps Spark 3.5, Scala 2, Python 3.

To begin utilizing AWS Glue 5.0 on an AWS Glue Studio pocket book or an interactive session via a Jupyter pocket book, set 5.0 within the %glue_version magic:

The next output reveals that the session is ready to make use of AWS Glue 5.0:

Setting Glue model to: 5.0

Conclusion

On this submit, we mentioned the important thing options and advantages of AWS Glue 5.0. You possibly can create new AWS Glue jobs on AWS Glue 5.0 to get the profit from the enhancements, or migrate your present AWS Glue jobs.

We want to thank the help of quite a few engineers and leaders who helped construct Glue 5.0 that permits clients with a efficiency optimized Spark runtime and several other new capabilities.

In regards to the Authors

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue workforce. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his highway bike.

Stuti Deshpande is a Large Knowledge Specialist Options Architect at AWS. She works with clients across the globe, offering them strategic and architectural steerage on implementing analytics options utilizing AWS. She has intensive expertise in large information, ETL, and analytics. In her free time, Stuti likes to journey, study new dance varieties, and luxuriate in high quality time with household and pals.

Martin Ma is a Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with enhancing the client expertise by making use of problem-solving abilities to invent new software program options, in addition to consistently trying to find methods to simplify present ones. In his spare time, he enjoys singing and enjoying the guitar.

Anshul Sharma is a Software program Growth Engineer in AWS Glue Workforce.

Rajendra Gujja is a Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with distributed computing and all the pieces and something about information.

Maheedhar Reddy Chappidi is a Sr. Software program Growth Engineer on the AWS Glue workforce. He’s obsessed with constructing fault tolerant and dependable distributed programs at scale. Exterior of his work, Maheedhar is obsessed with listening to podcasts and enjoying along with his two-year-old child.

Matt Su is a Senior Product Supervisor on the AWS Glue workforce. He enjoys serving to clients uncover insights and make higher choices utilizing their information with AWS Analytics companies. In his spare time, he enjoys snowboarding and gardening.

Savio Dsouza is a Software program Growth Supervisor on the AWS Glue workforce. His workforce works on generative AI purposes for the Knowledge Integration area and distributed programs for effectively managing information lakes on AWS and optimizing Apache Spark for efficiency and reliability.

Kartik Panjabi is a Software program Growth Supervisor on the AWS Glue workforce. His workforce builds generative AI options for the Knowledge Integration and distributed system for information integration.

Mohit Saxena is a Senior Software program Growth Supervisor on the AWS Glue and Amazon EMR workforce. His workforce focuses on constructing distributed programs to allow clients with simple-to-use interfaces and AI-driven capabilities to effectively remodel petabytes of knowledge throughout information lakes on Amazon S3, and databases and information warehouses on the cloud.