6 C
United States of America
Friday, November 15, 2024

Broaden knowledge entry by way of Apache Iceberg utilizing Delta Lake UniForm on AWS


The panorama of massive knowledge administration has been reworked by the rising reputation of open desk codecs corresponding to Apache Iceberg, Apache Hudi, and Linux Basis Delta Lake. These codecs, designed to handle the restrictions of conventional knowledge storage programs, have grow to be important in trendy knowledge architectures. As organizations undertake varied open desk codecs to go well with their particular wants, the demand for interoperability between these codecs has grown considerably. This interoperability is essential for enabling seamless knowledge entry, lowering knowledge silos, and fostering a extra versatile and environment friendly knowledge ecosystem.

Delta Lake UniForm is an open desk format extension designed to offer a common knowledge illustration that may be effectively learn by completely different processing engines. It goals to bridge the hole between varied knowledge codecs and processing programs, providing a standardized strategy to knowledge storage and retrieval. With UniForm, you possibly can learn Delta Lake tables as Apache Iceberg tables. This expands knowledge entry to broader choices of analytics engines.

This publish explores easy methods to begin utilizing Delta Lake UniForm on Amazon Net Providers (AWS). You’ll be able to discover ways to question Delta Lake native tables by way of UniForm from completely different knowledge warehouses or engines corresponding to Amazon Redshift for example of increasing knowledge entry to extra engines.

How Delta Lake UniForm works

UniForm permits different desk format shoppers corresponding to Apache Iceberg to entry Delta Lake tables. Below the hood, UniForm generates Iceberg metadata information (together with metadata and manifest information) which can be required for Iceberg shoppers to entry the underlying knowledge information in Delta Lake tables. Each Delta Lake and Iceberg metadata information reference the identical knowledge information. UniForm generates a number of desk format metadata with out duplicating the precise knowledge information. When an Iceberg shopper reads a UniForm desk, it first accesses the Iceberg metadata information for the UniForm desk, which then permits the Iceberg shopper to learn the underlying knowledge information.

There are two choices to make use of UniForm:

  • Create a brand new Delta Lake desk with UniForm
  • Allow UniForm in your current Delta Lake desk

First, to create a brand new Delta Lake desk enabling UniForm, you configure desk properties for UniForm in a CREATE TABLE DDL question. The desk properties are 'delta.universalFormat.enabledFormats'='iceberg' and 'delta.enableIcebergCompatV2'='true'. When these choices are set to the CREATE TABLE question, Iceberg metadata information are generated together with Delta Lake metadata information. Along with these choices, Delta Lake desk protocol variations that outline supported options by the desk corresponding to delta.minReaderVersion and delta.minWriterVersion are required to be set to 2 and 7 or extra respectively. For extra details about the desk protocol variations, discuss with What’s a desk protocol specification? in Delta Lake public doc. Appendix 1. Create a brand new Delta Lake desk with UniForm exhibits an instance question to create a brand new Delta Lake UniForm desk.

You can too allow UniForm on an current Delta Lake desk. This selection is appropriate if in case you have Delta Lake tables in your surroundings. Enabling UniForm doesn’t have an effect on your present operations on the Delta Lake tables. To allow UniForm on a Delta Lake desk, run REORG TABLE db.existing_delta_lake_table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2)). After operating this question, Delta Lake routinely generates Iceberg metadata information for the Iceberg shopper. Within the instance on this publish, you run this feature and allow UniForm after you create a Delta Lake desk.

For the details about enabling UniForm, discuss with Allow Delta Lake UniForm within the Delta Lake public doc. Observe that the additional package deal (delta-iceberg) is required to create a UniForm desk in AWS Glue Information Catalog. The additional package deal can be required to generate Iceberg metadata together with Delta Lake metadata for the UniForm desk.

Instance use case

A fictional firm constructed a knowledge lake with Delta Lake on Amazon Easy Storage Service (Amazon S3) that’s primarily used by way of Amazon Athena. In keeping with its utilization enlargement, this firm desires to develop knowledge entry to cloud-based knowledge warehouses corresponding to Amazon Redshift for versatile analytics use instances.

There are a couple of challenges to realize this requirement. Delta Lake isn’t natively supported in Amazon Redshift. For these knowledge warehouses, Delta Lake tables should be transformed to manifest tables, which requires further operational overhead. It’s essential to run the GENERATE command on Spark or use a crawler in AWS Glue to generate manifest tables, and it’s good to sync these manifest tables each time the Delta tables are up to date.

Delta Lake UniForm generally is a resolution to satisfy this requirement. With Delta Lake UniForm, you can also make the Delta Desk appropriate with the opposite open desk codecs corresponding to Apache Iceberg, which is natively supported in Amazon Redshift. Customers can question these Delta Lake tables as Iceberg tables by way of UniForm.

The next diagram describes the architectural overview to realize that requirement.

bdb4538_solution-overview

On this tutorial, you create a Delta Lake desk with an artificial evaluate dataset that features completely different merchandise and buyer opinions and allow UniForm on that Delta Lake desk to make it accessible from Amazon Redshift. Every part works as follows on this state of affairs:

  • Amazon EMR (Amazon EMR on EC2 cluster with Apache Spark): An Apache Spark utility on an Amazon EMR cluster creates a Delta Lake desk and permits UniForm on it. Solely Delta Lake shopper can write to the Delta Lake UniForm desk, making Amazon EMR act as a author.
  • Amazon Redshift: Amazon Redshift makes use of Iceberg shoppers to learn information from the Delta Lake UniForm desk. It’s restricted to studying information from the desk and can’t write to it.
  • Amazon S3 and AWS Glue Information Catalog: These are used to handle the underlying information and the catalog of the Delta Lake UniForm desk. The info and metadata information for the desk are saved in an S3 bucket. The desk is registered in AWS Glue Information Catalog.

Arrange sources

On this part, you full the next useful resource setup:

  1. Launch an AWS CloudFormation template to configure sources corresponding to S3 buckets, an Amazon Digital Personal Cloud (Amazon VPC) and a subnet, a database for Delta Lake in Information Catalog, AWS Id and Entry Administration (IAM) coverage and position with required permissions for Amazon EMR Studio, and an EC2 occasion profile for Amazon EMR on EC2 cluster
  2. Launch an Amazon EMR on EC2 cluster
  3. Create an Amazon EMR Studio Workspace
  4. Add a Jupyter Pocket book on Amazon EMR Studio Workspace
  5. Launch a CloudFormation template to configure Amazon Redshift Serverless and related subnets

Launch a CloudFormation template to configure primary sources

You utilize a offered CloudFormation template to arrange sources to construct Delta Lake UniForm environments. The template creates the next sources.

  • An S3 bucket to retailer the Delta Lake desk knowledge
  • An S3 bucket to retailer an Amazon EMR Studio Workspace metadata and configuration information
  • An IAM position for Amazon EMR Studio
  • An EC2 occasion profile for Amazon EMR on EC2 cluster
  • VPC and subnet for an Amazon EMR on EC2 cluster
  • A database for a Delta Lake desk in Information Catalog

Full the next steps to deploy the sources.

  1. Select Launch stack:

  1. For Stack identify, enter delta-lake-uniform-on-aws. For the Parameters, DeltaDatabaseName, PublicSubnetForEMRonEC2, and VpcCIDRForEMRonEC2 are set by default. You can too change the default values. Then, select Subsequent.
  2. Select Subsequent.
  3. Select I acknowledge that AWS CloudFormation may create IAM sources with customized names.
  4. Select Submit.
  5. After the stack creation is full, test the Outputs. The useful resource values are used within the following sections and within the Appendices.

Launch an Amazon EMR on EC2 cluster

Full the next steps to create an Amazon EMR on EC2 cluster.

  1. Open the Amazon EMR on EC2 console.
  2. Select Create cluster.
  3. Enter delta-lake-uniform-blog-post in Title and make sure selecting emr-7.3.0 as its launch label.
  4. For Utility bundle, choose Spark 3.5.1, Hadoop 3.3.6 and JupyterEnterpriseGateway 2.6.0.
  5. For AWS Glue Information Catalog settings, allow Use for Spark desk metadata.
  6. For Networking, enter the values from the CloudFormation Outputs tab for VpcForEMR and PublicSubnetForEMR into Digital personal cloud (VPC) andSubnet respectively. For EC2 safety teams, maintain Create ElasticMapReduce-Main for Main node, and Create ElasticMapReduce-Core for Core and activity nodes. The safety teams for the Amazon EMR major and core nodes are routinely created.
  7. For Cluster logs, enter s3://<DeltaLakeS3Bucket>/emr-cluster-logs because the Amazon S3 location. Change <DeltaLakeS3Bucket> with the S3 bucket from the CloudFormation stack Outputs tab.
  8. For Software program settings, choose Load JSON from Amazon S3 and enter s3://aws-blogs-artifacts-public/artifacts/BDB-4538/config.json because the Amazon S3 location.
  9. For Amazon EMR service position in Id and Entry Administration (IAM) roles part, select Create a service position. Then, set default to Safety Group. If there are current safety teams for Amazon EMR Main and Core or Process nodes, set these safety teams to Safety Group.
  10. For EC2 occasion profile for Amazon EMR, select Select an current occasion profile, and set EMRInstanceProfileRole to Occasion profile.
  11. After reviewing the configuration, select Create cluster.
  12. After the cluster standing is Ready on the Amazon EMR console, the cluster setup is full (it roughly takes 10 minutes).

Create an Amazon EMR Studio Workspace

Full the next steps to create an Amazon EMR Studio Workspace to make use of Delta Lake UniForm on Amazon EMR on EC2.

  1. Open Amazon EMR Studio console.
  2. Select Create Studio.
  3. For Setup choices, select Customized.
  4. For Studio settings, enter delta-lake-uniform-studio because the Studio identify.
  5. For S3 location for Workspace storage, select Choose current location as a Workspace storage. The S3 location (s3://aws-emr-studio-<ACCOUNT_ID>-<REGION>-delta-lake-uniform-on-aws) could be obtained from EMRStudioS3Bucket on the CloudFormation Outputs tab. Then, select EMRStudioRole because the IAM position (you’ll find the IAM Position identify on the CloudFormation Outputs tab).
  6. For Workspace settings, enter delta-lake-workspace because the Workspace identify.
  7. In Networking and safety, select the VPC ID and Subnet ID that you simply created in Launch an AWS CloudFormation template. You’ll be able to receive the VPC ID and Subnet ID from the VpcForEMR and PublicSubnetForEMR keys on the CloudFormation Outputs tab respectively.
  8. After reviewing the settings, select Create Studio and launch Workspace.
  9. After creating the Studio Workspace is full, you’re redirected to Jupyter Pocket book.

Add Jupyter Pocket book

Full the next steps to configure a Jupyter Pocket book to make use of Delta Lake UniForm with Amazon EMR.

  1. Obtain delta-lake-uniform-on-aws.ipynb.
  2. Select the arrow icon on the prime of the web page and add the Pocket book you simply downloaded.

  1. Select and open the pocket book (delta-lake-uniform-on-aws.ipynb) you uploaded within the left pane.
  2. After the pocket book is opened, select EMR Compute within the navigation pane.

  1. Connect the Amazon EMR on EC2 cluster you created within the earlier part. Select EMR on EC2 cluster and set the cluster you created beforehand to EMR on EC2 cluster, then select Connect.
  2. After attaching the cluster is profitable, Cluster is hooked up to the Workspace is displayed on the console.

Create a workgroup and a namespace for Amazon Redshift Serverless

For this step, you configure a workgroup and a namespace for Amazon Redshift Serverless to run queries on a Delta Lake UniForm desk. You additionally configure two subnets in the identical VPC created by the CloudFormation stack delta-lake-uniform-on-aws. To deploy the sources, full the next steps:

  1. Select Launch stack:

  1. For Stack identify, enter redshift-serverless-for-delta-lake-uniform.
  2. For Parameters, enter the Availability Zone and an IP vary for every subnet. The VPC ID is routinely retrieved from the CloudFormation stack you created in Launch an AWS CloudFormation template to configure primary sources. If you happen to change the default subnet, word that at the least one subnet must be the identical subnet you created for the Amazon EMR on EC2 cluster (by default, the subnet for Amazon EMR on EC2 cluster is routinely retrieved throughout this CloudFormation stack creation). You’ll be able to test the subnet for the cluster on the CloudFormation Outputs Then, select Subsequent.
  3. Select Subsequent once more, after which select Submit.
  4. After the stack creation is full, test the CloudFormation Outputs Make an observation of the 2 Subnet IDs on the Outputs tab to make use of later in Run queries from Amazon Redshift towards the UniForm desk.

Now you’re prepared to make use of Delta Lake UniForm on Amazon EMR.

Allow Delta Lake UniForm

Begin by making a Delta Lake desk that incorporates the client evaluate dataset. After creating the desk, run REORG question to allow UniForm on the Delta Lake desk.

Create a Delta Lake desk

Full the next steps to create a Delta Lake desk based mostly on a buyer evaluate dataset and evaluate the desk metadata.

  1. Return to the Jupyter Pocket book related to the Amazon EMR on EC2 cluster and run the next cell so as to add delta-iceberg.jar to make use of UniForm and configure the spark extension.

  1. Initialize the SparkSession. The next configuration is important to make use of Iceberg by way of UniForm. Earlier than operating the code, substitute <DeltaLakeS3Bucket> with the identify of the S3 bucket for Delta Lake, which you’ll find on the CloudFormation stack Outputs tab.

bdb4538-2_2-notebook

  1. Create a Spark DataFrame from buyer opinions.

bdb4538-2_3-notebook_df

  1. Create a Delta Lake desk with the client opinions dataset. This step takes roughly 5 minutes.

bdb4538-2_4-notebook

  1. Run DESCRIBE EXTENDED {DB_TBL} within the subsequent cell to evaluate the desk. The output contains the desk schema, location, desk properties, and so forth.

bdb4538-2_5-notebook

The Delta Lake desk creation is full. Subsequent, allow UniForm on this Delta Lake desk.

Run REORG question to allow UniForm

To permit an Iceberg shopper to entry the Delta Lake desk you created, allow UniForm on the desk. You can too create a brand new Delta Lake desk with UniForm enabled. For extra data, see Appendix 1 on the finish of this publish. To allow UniForm and evaluate the desk metadata, full the next steps.

  1. Run the next question to allow UniForm on the Delta Lake desk. To allow UniForm on an current Delta Lake desk, you run REORG question towards the desk.

bdb4539-3_1-uniform_reorg

  1. Run DESCRIBE EXTENDED {DB_TBL} within the subsequent cell to evaluate the desk metadata and examine it from earlier than and after enabling UniForm. The brand new properties, corresponding to delta.enableIcebergCompatV2=true and delta.universalFormat.enabledFormats=iceberg, are added to the desk properties.

bdb4538-3_2-uniform_metadata

  1. Run aws s3 ls s3://<DeltaLakeS3Bucket>/warehouse/ --recursive to verify if the Iceberg desk metadata is created. Change <DeltaLakeS3Bucket> with the S3 bucket from the CloudFormation Outputs tab. The next screenshot exhibits the command output of desk metadata and knowledge information. You’ll be able to verify that Delta Lake UniForm generates each Iceberg metadata and Delta Lake metadata information as indicated by the crimson rectangles under.

bdb4538-3_3-uniform_metadata

  1. Earlier than querying the Delta Lake UniForm desk from an Iceberg shopper, run the next analytic question for the Delta Lake UniForm desk from Amazon EMR on EC2 facet, and evaluate the opinions depend by every product class.

bdb4538-3_4-uniform_query

  1. The question consequence exhibits the output of the opinions depend by product_category:

Enabling UniForm on the Delta Lake desk is full, and now you possibly can question the Delta Lake desk from an Iceberg shopper. Subsequent, you question the Delta Lake desk as an Iceberg desk from Amazon Redshift.

Run queries towards the UniForm desk from Amazon Redshift

Within the earlier part, you enabled UniForm in your current Delta Lake desk. This lets you run queries on a Delta Lake desk as if it had been an Iceberg desk from Amazon Redshift. On this part, you run an analytic question on the UniForm desk utilizing Amazon Redshift Serverless and add information with a brand new product class to the UniForm desk by way of the Jupyter Pocket book related to the Amazon EMR on EC2 cluster. Then, you confirm the added information with one other analytic question from Amazon Redshift. You’ll be able to verify that Delta Lake UniForm permits Amazon Redshift to question the Delta Lake desk by way of this part.

Question the UniForm desk from Amazon Redshift Serverless

  1. Open Amazon Redshift Serverless console
  2. In Namespaces/Workgroups, choose the delta-lake-uniform-namespace that you simply created utilizing the CloudFormation stack.
  3. Select Question knowledge on the correct prime nook to open the Amazon Redshift question editor.
  4. After opening the editor, choose the delta-lake-uniform-workgroup workgroup within the left pane.
  5. Select Create connection.
  6. After you efficiently create a connection, you possibly can see the delta_uniform_db database and customer_review desk you created within the left pane of the editor.
  7. Copy and paste the next analytic question to the editor and select Run.
SELECT product_category, depend(*) as count_by_product_category 
FROM "awsdatacatalog"."delta_uniform_db"."customer_reviews"
GROUP BY product_category ORDER BY count_by_product_category DESC

  1. The editor exhibits the identical results of the evaluate depend by product_category as you obtained from Jupyter Pocket book in Run REORG question to allow UniForm.

bdb4538-4_1-rs_result

Add new product class information into Delta Lake UniForm desk from Amazon EMR

Return to the Jupyter Pocket book on Amazon EMR Workspace so as to add new information with a brand new product class (Books) into the Delta Lake UniForm desk. After including the information, question the UniForm desk once more from Amazon Redshift Serverless.

  1. On the Jupyter Pocket book, go to Add new product class information into the UniForm desk and run the next cell to load new information.

bdb4538-4_2-append_books

  1. Run the next cell and evaluate the 5 information with Books because the product class. The next screenshot exhibits the output of this code.

bdb4538-4_3-append_listbooks

  1. Add the brand new opinions with Books product class. This takes round 2 minutes.

bdb4538-4_4-append_append

Within the subsequent part, you run a question on the UniForm desk from Amazon Redshift Serverless to test if the brand new information with the Books product class have been added.

Overview the added information in Delta Lake UniForm desk from Amazon Redshift Serverless

To test if the consequence output contains the information of Books product class:

  1. On the question editor of Amazon Redshift, run the next question and test if the consequence output contains the information of Books product class.
SELECT product_category, depend(*) as count_by_product_category 
FROM "awsdatacatalog"."delta_uniform_db"."customer_reviews"
GROUP BY product_category ORDER BY count_by_product_category DESC

  1. The next screenshot exhibits the output of the question you ran within the earlier step. You’ll be able to verify the brand new product class Books has been added to the desk from Amazon Redshift facet.

bdb4538-4_5-rs-result
Now you possibly can question from Amazon Redshift towards the Delta Lake desk by enabling Delta Lake UniForm.

Clear up sources

To wash up your sources, full the next steps:

  1. Within the Amazon EMR Workspaces console, select Actions after which Delete to delete the workspace.
  2. Select Delete to delete the Studio.
  3. Within the Amazon EMR on EC2 console, select Terminate to delete the Amazon EMR on EC2 cluster.
  4. Within the Amazon S3 console, select Empty to delete all objects within the following S3 buckets.
    1. The S3 bucket for Amazon EMR Studio corresponding to aws-emr-studio-<ACCOUNT_ID>-<REGION>-delta-lake-uniform-on-aws. Change <ACCOUNT_ID> and <REGION> along with your account ID and the bucket’s area.
    2. The S3 bucket for Delta Lake tables corresponding to delta-lake-uniform-on-aws-deltalakes3bucket-abcdefghijk.
  5. After you verify the 2 buckets are empty, delete the CloudFormation stack redshift-serverless-for-delta-lake-uniform.
  6. After the primary CloudFormation stack has been deleted, delete the CloudFormation stack delta-lake-uniform-on-aws.

Conclusion

Delta Lake UniForm on AWS represents an development in addressing the challenges of information interoperability and accessibility in trendy massive knowledge architectures. By enabling Delta Lake tables to be learn as Apache Iceberg tables, UniForm expands knowledge entry capabilities, permitting organizations to make use of a broader vary of analytics engines and knowledge warehouses corresponding to Amazon Redshift.

The sensible implications of this expertise are substantial, providing new potentialities for knowledge evaluation and insights throughout numerous platforms. As organizations proceed to navigate the complexities of massive knowledge, options like Delta Lake UniForm that promote interoperability and scale back knowledge silos will grow to be more and more helpful.

By adopting these superior open desk codecs and utilizing cloud platforms corresponding to AWS, organizations can construct extra strong and environment friendly knowledge ecosystems. This strategy not solely enhances the worth of current knowledge property but in addition fosters a extra agile and adaptable knowledge technique, finally driving innovation and enhancing decision-making processes in our data-driven world.

Appendix 1: Create a brand new Delta Lake desk with UniForm

You’ll be able to create a Delta Lake desk with UniForm enabled utilizing the next DDL.

CREATE TABLE IF NOT EXISTS delta_uniform_db.customer_reviews_create (
   market string,
   customer_id string,
   review_id string,
   product_id string,
   product_title string,
   star_rating bigint,
   helpful_votes bigint,
   total_votes bigint,
   perception string,
   review_headline string,
   review_body string,
   review_date timestamp,
   review_year bigint,
   product_category string)
USING delta
TBLPROPERTIES ( 
   'delta.universalFormat.enabledFormats'='iceberg',
   'delta.enableIcebergCompatV2'='true',
   'delta.minReaderVersion'='2',
   'delta.minWriterVersion'='7')

Appendix 2: Run queries from Snowflake towards the UniForm desk

Delta Lake UniForm additionally lets you run queries on a Delta Lake desk from Snowflake. On this part, you run the identical analytic question on the UniForm desk utilizing Snowflake as you beforehand did utilizing Amazon Redshift Serverless in Run queries from Amazon Redshift towards the UniForm desk. You then verify that the question outcomes from Snowflake match the outcomes obtained from the Amazon Redshift Serverless question.

Configure IAM roles for Snowflake to entry AWS Glue Information Catalog and Amazon S3

To question the Delta Lake UniForm desk in Information Catalog from Snowflake, the next configurations are required.

  1. IAM roles: Create IAM roles for Snowflake to entry Information Catalog and Amazon S3.
  2. Information Catalog integration with Snowflake: Snowflake supplies two catalog choices for Iceberg tables corresponding to Utilizing Snowflake because the Iceberg catalog and Utilizing an exterior catalog corresponding to Information Catalog. On this publish, you select AWS Glue Information Catalog as an exterior catalog. For details about the catalog choices, discuss with Iceberg catalog choices within the Snowflake public documentation.
  3. An exterior quantity creation for Amazon S3: To entry the UniForm desk from Snowflake, an exterior quantity for Amazon S3 must be configured. With this configuration, Snowflake can join the S3 bucket that you simply created for Iceberg tables. For details about the exterior quantity, discuss with Configure an exterior quantity for Iceberg tables.

Create IAM roles for Snowflake to entry AWS Glue Information Catalog and Amazon S3

Create the next two IAM roles for Snowflake to entry AWS Glue Information Catalog and Amazon S3.

  • SnowflakeIcebergGlueCatalogRole: This IAM position is used for Snowflake to entry the Delta Lake UniForm desk in AWS Glue Information Catalog.
  • SnowflakeIcebergS3Role: This IAM position is used for Snowflake to entry the desk’s underlying knowledge within the S3 bucket.

To configure the IAM roles, full the next steps:

  1. Select Launch stack:

  1. Enter snowflake-iceberg because the stack identify and select Subsequent.
  2. Choose I acknowledge that AWS CloudFormation may create IAM sources with customized names.
  3. Select Submit.
  4. After the stack creation is full, test the CloudFormation Outputs tab. Make an observation of the names and ARNs of the 2 IAM roles, that are used within the following part.

Create an AWS Glue Information Catalog Integration

Create a catalog integration for AWS Glue Information Catalog. For extra details about the catalog integration for AWS Glue Information Catalog, discuss with Configure a catalog integration for AWS Glue within the Snowflake public documentation. To configure the catalog integration, full the next steps:

  1. Entry your Snowflake account and open an empty worksheet (question editor).
  2. Run the next question and create a catalog integration with AWS Glue Information Catalog. Change <YOUR_ACCOUNT_ID> with the IAM position ARN from the snowflake-iceberg CloudFormation Ouputs tab, and substitute <REGION> with the area of AWS Glue Information Catalog.
CREATE CATALOG INTEGRATION glue_catalog_integration
  CATALOG_SOURCE=GLUE
  CATALOG_NAMESPACE='delta_uniform_db'
  TABLE_FORMAT=ICEBERG
  GLUE_AWS_ROLE_ARN='arn:aws:iam::<YOUR_ACCOUNT_ID>:position/SnowflakeIcebergGlueCatalogRole'
  GLUE_CATALOG_ID='<YOUR_ACCOUNT_ID>'
  GLUE_REGION='<REGION>'
  ENABLED=TRUE;

  1. Retrieve GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID by utilizing DESCRIBE CATALOG INTEGRATION glue_catalog_integration within the editor. The output is much like the next:
+------------------------------------------------------------------------------------------------------------------------------+
| property                 | property_type | property_value                                                 | property_default |
|--------------------------+---------------+----------------------------------------------------------------+------------------|
| ENABLED                  | Boolean       | true                                                           | false            |
| CATALOG_SOURCE           | String        | GLUE                                                           |                  |
| CATALOG_NAMESPACE        | String        | delta_uniform_db                                               |                  |
| TABLE_FORMAT             | String        | ICEBERG                                                        |                  |
| REFRESH_INTERVAL_SECONDS | Integer       | 30                                                             | 30               |
| GLUE_AWS_ROLE_ARN        | String        | arn:aws:iam::123456789012:position/SnowflakeIcebergGlueCatalogRole |                  |
| GLUE_CATALOG_ID          | String        | 123456789012                                                   |                  |
| GLUE_REGION              | String        | us-east-1                                                      |                  |
| GLUE_AWS_IAM_USER_ARN    | String        | arn:aws:iam::123456789012:person/<ID>                            |                  |
| GLUE_AWS_EXTERNAL_ID     | String        | An exterior ID specified on the IAM Position belief relationships   |                  |
| COMMENT                  | String        |                                                                |                  |
+------------------------------------------------------------------------------------------------------------------------------+

  1. Replace the IAM position you created utilizing the CloudFormation stack to allow Snowflake to entry AWS Glue Information Catalog utilizing that IAM position. Open Belief Relationships of SnowflakeIcebergGlueCatalogRole on the IAM console, select Edit and replace the belief relationship utilizing the next coverage. Change <GLUE_AWS_IAM_USER_ARN> and <GLUE_AWS_EXTERNAL_ID> with the names you obtained within the earlier step.
{
   "Model": "2012-10-17",
   "Assertion": [
	  {
		 "Sid": "",
		 "Effect": "Allow",
		 "Principal": {
			"AWS": "<GLUE_AWS_IAM_USER_ARN>"
		 },
		 "Action": "sts:AssumeRole",
		 "Condition": {
			"StringEquals": {
			   "sts:ExternalId": "<GLUE_AWS_EXTERNAL_ID>"
			}
		 }
	  }
   ]
}

You accomplished establishing the IAM position for Snowflake to entry your Information Catalog sources. Subsequent, configure the IAM position for Amazon S3 entry.

Register Amazon S3 as an exterior quantity

On this part, you configure an exterior quantity for Amazon S3. Snowflake accesses the UniForm desk knowledge information in S3 by way of the exterior quantity. For the configuration of an exterior quantity for S3, discuss with Configure an exterior quantity for Amazon S3 in Snowflake public documentation. To configure the exterior quantity, full the next steps:

  1. Within the question editor, run the next question to create an exterior quantity for the Delta Lake S3 bucket. Change <DeltaLakeS3Bucket> with the identify of the S3 bucket that you simply created in Launch a CloudFormation template to configure primary sources from the CloudFormation Outputs tab. Change <ACCOUNT_ID> along with your AWS account ID.
CREATE OR REPLACE EXTERNAL VOLUME delta_lake_uniform_s3
   STORAGE_LOCATIONS =
	  (
		 (
			NAME = 'delta-lake-uniform-on-aws'
			STORAGE_PROVIDER = 'S3'
			STORAGE_BASE_URL = 's3://<DeltaLakeS3Bucket>'
			STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<ACCOUNT_ID>:position/SnowflakeIcebergS3Role'
		 )
	  );

  1. Retrieve STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID by operating DESCRIBE EXTERNAL VOLUME delta_lake_uniform_s3 within the editor. The output is much like the next:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| parent_property   | property           | property_type | property_value                                                                                                                       | property_default |
|-------------------+--------------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------+------------------|
|                   | ALLOW_WRITES       | Boolean       | true                                                                                                                                 | true             |
| STORAGE_LOCATIONS | STORAGE_LOCATION_1 | String        |                                                                                                                                                |                  |
| STORAGE_LOCATIONS | ACTIVE             | String        | uniform_s3_location                                                                                                                  |                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

  1. Replace the IAM position you created utilizing the CloudFormation template (in Create IAM roles for Snowflake to entry AWS Glue Information Catalog and Amazon S3) to allow Snowflake to make use of this IAM position. Open Belief Relationships of SnowflakeIcebergS3Role on the IAM console, select Edit, and replace the belief relationship with the next coverage. Change <STORAGE_AWS_IAM_USER_ARN> and <STORAGE_AWS_EXTERNAL_ID> with the values from the earlier step.
{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<STORAGE_AWS_IAM_USER_ARN&>"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<STORAGE_AWS_EXTERNAL_ID>"
        }
      }
    }
  ]
}

The subsequent step is to create an Iceberg desk to run queries from Snowflake.

Create an Iceberg desk in Snowflake

On this part, you create an Iceberg desk in Snowflake. The desk is an entry level for Snowflake to entry the Delta Lake UniForm desk in AWS Glue Information Catalog. To create the desk, full the next steps:

  1. (Non-compulsory) If you happen to don’t have a database in Snowflake, run CREATE DATABASE <DATABASE_NAME>, changing <DATABASE_NAME> with a singular database identify for the Iceberg desk.
  2. Run the next question within the Snowflake question editor. On this case, the database delta_uniform_snow_db is chosen for the desk. Configure the next parameters:
    1. EXTERNAL_VOLUME: created by CREATE OR REPLACE EXTERNAL VOLUME question within the earlier part, corresponding to delta_lake_uniform_s3.
    2. CATALOG: created by the CREATE CATALOG INTEGRATION question within the earlier part, corresponding to glue_catalog_integration.
    3. CATALOG_TABLE_NAME: the identify of Delta Lake UniForm desk you created in Information Catalog corresponding to customer_reviews.

The whole question is under:

CREATE OR REPLACE ICEBERG TABLE customer_reviews_snow
  EXTERNAL_VOLUME='delta_lake_uniform_s3'
  CATALOG='glue_catalog_integration'
  CATALOG_TABLE_NAME='customer_reviews';

After the desk creation is full, you’re prepared to question the UniForm desk in AWS Glue Information Catalog from Snowflake.

Question the UniForm desk from Snowflake

On this step, you question the UniForm desk from Snowflake. Paste and run the next analytic question within the Snowflake question editor.

SELECT product_category, depend(*) as count_by_product_category 
FROM customer_reviews_snow 
GROUP BY product_category ORDER BY count_by_product_category DESC

The question consequence exhibits the identical output as you noticed in Overview the added information in Delta Lake UniForm desk from Amazon Redshift Serverless part.

+--------------------------------------------------+
| PRODUCT_CATEGORY     | COUNT_BY_PRODUCT_CATEGORY |
|----------------------+---------------------------|
| Office_Products      | 9673711                   |
| Books                | 9672664                   |
| Attire              | 6448747                   |
| Computer systems            | 3224215                   |
| Beauty_Personal_Care | 3223599                   |
+--------------------------------------------------+

Now you possibly can question from Snowflake towards the Delta Lake desk by enabling Delta Lake UniForm.


In regards to the Authors

Tomohiro Tanaka is a Senior Cloud Assist Engineer at Amazon Net Providers. He’s captivated with serving to prospects use Apache Iceberg for his or her knowledge lakes on AWS. In his free time, he enjoys a espresso break together with his colleagues and making espresso at residence.

Noritaka Sekiyama is a Principal Huge Information Architect on the AWS Glue crew. He works based mostly in Tokyo, Japan. He’s answerable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking together with his highway bike.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles