Databricks pioneered the open information lakehouse structure and has been on the forefront of format interoperability. We’re excited to see extra platforms undertake the lakehouse structure and begin to embrace interoperable codecs and requirements. Interoperability lets prospects cut back costly information duplication by utilizing a single copy of information with their alternative of analytics and AI instruments for his or her workloads. Specifically, a standard sample for our prospects is to make use of Databricks’ best-in-class ETL worth/efficiency for upstream information, accessing it from BI and analytics instruments, equivalent to Snowflake.
Unity Catalog is a unified and open governance answer for information and AI belongings. A key function of Unity Catalog is its implementation of the Iceberg REST Catalog APIs. This makes it easy to make use of an Iceberg-compliant reader with out having to manually refresh your metadata location.
On this weblog put up, we are going to cowl why the Iceberg REST catalog is helpful and stroll by way of an instance of the best way to learn Unity Catalog tables in Snowflake.
Be aware: This performance is obtainable throughout cloud suppliers. The next directions are particular to AWS S3, however it’s attainable to make use of different object storage platforms equivalent to Azure Knowledge Lake Storage (ADLS) or Google Cloud Storage (GCS).
Iceberg REST API Catalog Integration
Apache Iceberg™ maintains atomicity and consistency by creating new metadata information for every desk change. This ensures that incomplete writes don’t corrupt an current metadata file. The Iceberg catalog tracks the brand new metadata per write. Nevertheless, not all engines can join to each Iceberg catalog, forcing prospects to manually preserve monitor of the brand new metadata file location.
Iceberg solves interoperability throughout engines and catalogs with the Iceberg REST Catalog API. The Iceberg REST catalog is a standardized, open API specification which is a unified interface for Iceberg catalogs, decoupling catalog implementations from purchasers.
Unity Catalog has applied the Iceberg REST Catalog APIs for the reason that launch of Common Format (UniForm) in 2023. Unity Catalog exposes the most recent desk metadata, guaranteeing interoperability with any Iceberg consumer suitable with the Iceberg REST Catalog equivalent to Apache Spark™, Apache Trino, and Snowflake. Unity Catalog’s Iceberg REST Catalog endpoints lengthen governance and Delta Lake desk options like Change Knowledge Feed.
Snowflake’s REST API catalog integration permits you to hook up with Unity Catalog’s Iceberg REST APIs to retrieve the most recent metadata file location. Because of this with Unity Catalog, you possibly can learn tables immediately in Snowflake as in the event that they had been Iceberg.
Be aware: As of writing, Snowflake’s assist of the Iceberg REST Catalog is in Public Preview. Nevertheless, Unity Catalog’s Iceberg REST APIs are Usually Out there.
There are 4 steps to making a REST catalog integration in Snowflake:
- Allow UniForm on a Delta Lake desk in Databricks to generate Iceberg metadata
- Register Unity Catalog in Snowflake as your catalog
- Register an S3 Bucket in Snowflake so it acknowledges the supply information
- Create an Iceberg desk in Snowflake so you possibly can question your information
Getting Began
We’ll begin in Databricks, with our Unity Catalog-managed desk, and we’ll guarantee it may be learn as Iceberg. Then, we’ll transfer to Snowflake to finish the remaining steps.
Earlier than we begin, there are just a few parts wanted:
- A Databricks account with Unity Catalog (That is enabled by default for brand new workspaces)
- An AWS S3 bucket and IAM privileges
- A Snowflake account that may entry your Databricks occasion and S3
Unity Catalog namespaces comply with a catalog_name.schema_name.table_name format. Within the instance under, we’ll use uc_catalog_name.uc_schema_name.uc_table_name for our Databricks desk.
Step 1: Allow UniForm on a Delta desk in Databricks
In Databricks, you possibly can allow UniForm on a Delta Lake desk. By default, new tables are managed by Unity Catalog. Full directions can be found within the UniForm documentation however are additionally included under.
For a brand new desk, you possibly can allow UniForm throughout desk creation in your workspace:
CREATE TABLE uc_table_name(c1 INT) TBLPROPERTIES(
'delta.columnMapping.mode' = 'title',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
In case you have an current desk, you are able to do this through an ALTER TABLE command:
ALTER TABLE uc_table_name SET TBLPROPERTIES(
'delta.columnMapping.mode' = 'title',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
You possibly can affirm {that a} Delta desk has UniForm enabled within the Catalog Explorer beneath the Particulars tab, with the metadata location. It ought to look one thing like this:
Step 2: Register Unity Catalog in Snowflake
Whereas nonetheless in Databricks, create a service principal from the workspace admin settings and generate the accompanying secret and consumer ID. As an alternative of a service principal, it’s also possible to authenticate with private tokens for debugging and testing functions, however we suggest utilizing a service principal for growth and manufacturing workloads. From this step, you have to your <deployment-name> and the values in your OAuth <client-id> and <secret> so you possibly can authenticate the combination in Snowflake.
Now swap over to your Snowflake account.
Be aware: There are just a few naming variations between Databricks and Snowflake that could be complicated:
- A “catalog” in Databricks is a “warehouse” within the Snowflake Iceberg catalog integration configuration.
- A “schema” in Databricks is a “catalog_namespace” within the Snowflake Iceberg catalog integration.
You’ll see within the instance under that the CATALOG_NAMESPACE worth is uc_schema_name from our Unity Catalog desk.
In Snowflake, create a catalog integration for Iceberg REST catalogs. Following that course of, you’ll create a catalog integration as under:
CREATE OR REPLACE CATALOG INTEGRATION unity_catalog_int_oauth
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = 'uc_schema_name'
REST_CONFIG = (
CATALOG_URI = 'https://<deployment-name>.cloud.databricks.com/api/2.1/unity-catalog/iceberg'
WAREHOUSE = 'uc_catalog_name>'
)
REST_AUTHENTICATION = (
TYPE = OAUTH
OAUTH_TOKEN_URI = 'https://<deployment-name>.cloud.databricks.com/oidc/v1/token'
OAUTH_CLIENT_ID = '<client-id>'
OAUTH_CLIENT_SECRET = '<secret>'
OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql')
)
ENABLED = TRUE
REFRESH_INTERVAL_SECONDS = '<interval>';
The REST API Catalog Integration additionally unlocks time-based computerized refresh. With computerized refresh, Snowflake will ballot for the most recent metadata location from Unity Catalog on a time interval outlined for the catalog integration. Nevertheless, computerized refresh is incompatible with guide refresh, requiring customers to attend as much as the time interval after a desk replace. The REFRESH_INTERVAL_SECONDS parameter configured on the catalog integration applies to all Snowflake Iceberg tables created with this integration. It isn’t customizable per desk.
Step 3: Register your S3 Bucket in Snowflake
In Snowflake, configure an exterior quantity for Amazon S3. This entails creating an IAM position in AWS, configuring the position’s belief coverage, after which creating an exterior quantity in Snowflake utilizing the position’s ARN.
For this step, you’ll use the identical S3 bucket that Unity Catalog is pointed to.
CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume
STORAGE_LOCATIONS =
(
(
NAME = 'my-s3-us-west-2'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://<bucket-name>/'
STORAGE_AWS_ROLE_ARN = '<aws-role-arn>'
STORAGE_AWS_EXTERNAL_ID = '<external-id>'
)
);
Step 4: Create an Apache Iceberg™ desk in Snowflake
In Snowflake, create an Iceberg desk with the beforehand created catalog integration and exterior quantity to connect with the Delta Lake desk. You possibly can select the title in your Iceberg desk in Snowflake; it doesn’t must match the Delta Lake desk in Databricks.
Be aware: The proper mapping for the CATALOG_TABLE_NAME in Snowflake is the Databricks desk title. In our instance, that is uc_table_name. You do not want to specify the catalog or schema at this step, as a result of they had been already specified within the catalog integration.
CREATE OR REPLACE ICEBERG TABLE <snowflake_table_name>
EXTERNAL_VOLUME = 'iceberg_external_volume'
CATALOG = 'unity_catalog_int_oauth'
CATALOG_TABLE_NAME = 'uc_table_name'
AUTO_REFRESH = TRUE;
Optionally, you possibly can allow auto-refresh utilizing the catalog integration time interval by including AUTO_REFRESH = TRUE to the command. Be aware that if auto-refresh is enabled, guide refresh is disabled.
You may have now efficiently learn the Delta Lake desk in Snowflake.
Ending Up: Take a look at the Connection
In Databricks, replace the Delta desk information by inserting a brand new row.
For those who beforehand enabled auto-refresh, the desk will replace routinely on the required time interval. For those who didn’t, you possibly can manually refresh by working ALTER ICEBERG TABLE <snowflake_table_name> REFRESH.
Be aware: if you happen to beforehand enabled auto-refresh, you can’t run the guide refresh command and might want to watch for the auto-refresh interval to finish to refresh the desk.
Video Demo
If you need a video tutorial, this video demonstrates the best way to carry these steps collectively to learn Delta tables with UniForm in Snowflake.
We’re thrilled by continued assist for the lakehouse structure. Prospects now not should duplicate information, lowering price and complexity. This structure additionally permits prospects to decide on the suitable software for the suitable workload.
The important thing to an open lakehouse is storing your information in an open format equivalent to Delta Lake or Iceberg. Proprietary codecs lock prospects into an engine, however open codecs offer you flexibility and portability. Regardless of the platform, we encourage prospects to at all times personal their very own information as step one into interoperability. Within the coming months, we are going to proceed to construct options that make it easier to handle an open information lakehouse with Unity Catalog.