11.3 C
United States of America
Saturday, November 23, 2024

Unleash deeper insights with Amazon Redshift knowledge sharing for knowledge lake tables


Amazon Redshift has established itself as a extremely scalable, totally managed cloud knowledge warehouse trusted by tens of hundreds of shoppers for its superior price-performance and superior knowledge analytics capabilities. Pushed primarily by buyer suggestions, the product roadmap for Amazon Redshift is designed to ensure the service repeatedly evolves to fulfill the ever-changing wants of its customers.

Over time, this customer-centric strategy has led to the introduction of groundbreaking options similar to zero-ETL, knowledge sharing, streaming ingestion, knowledge lake integration, Amazon Redshift ML, Amazon Q generative SQL, and transactional knowledge lake capabilities. The newest innovation in Amazon Redshift knowledge sharing capabilities additional enhances the service’s flexibility and collaboration potential.

Amazon Redshift now permits the safe sharing of knowledge lake tables—often known as exterior tables or Amazon Redshift Spectrum tables—which might be managed within the AWS Glue Information Catalog, in addition to Redshift views referencing these knowledge lake tables. This breakthrough empowers knowledge analytics to span the complete breadth of shareable knowledge, permitting you to seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and AWS Areas—with out the overhead of bodily knowledge motion or recreating safety insurance policies for knowledge lake tables and Redshift views on every warehouse.

Through the use of granular entry controls, knowledge sharing in Amazon Redshift helps knowledge homeowners preserve tight governance over who can entry the shared data. On this submit, we discover highly effective use circumstances that display how one can improve cross-team and cross-organizational collaboration, cut back overhead, and unlock new insights by utilizing this revolutionary knowledge sharing performance.

Overview of Amazon Redshift knowledge sharing

Amazon Redshift knowledge sharing lets you securely share your knowledge with different Redshift warehouses, with out having to repeat or transfer the information.

Information shared between warehouses doesn’t require the information to be bodily copied or moved—as an alternative, knowledge stays within the unique Redshift warehouse, and entry is granted to different licensed customers as a part of a one-time setup. Information sharing gives granular entry management, permitting you to regulate which particular tables or views are shared, and which customers or companies can entry the shared knowledge.

Since shoppers entry the shared knowledge in-place, they at all times entry the most recent state of the shared knowledge. Information sharing even permits for the automated sharing of recent tables created after that datashare was established.

You possibly can share knowledge throughout totally different Redshift warehouses inside or throughout AWS accounts, and you can too do cross-region knowledge sharing. This lets you share knowledge with companions, subsidiaries, or different components of your group, and permits the highly effective workload isolation use case, as proven within the following diagram. With the seamless integration of Amazon Redshift with AWS Information Alternate, knowledge will also be monetized and shared publicly, and public datasets similar to census knowledge may be added to a Redshift warehouse with just some steps.

Figure 1: Amazon Redshift data sharing between producer and consumer warehouses

Determine 1: Amazon Redshift knowledge sharing between producer and shopper warehouses

The information sharing capabilities in Amazon Redshift additionally allow the implementation of a knowledge mesh structure, as proven within the following diagram. This helps democratize knowledge throughout the group by decreasing boundaries to accessing and utilizing knowledge throughout totally different enterprise models and groups. For datasets with a number of authors, Amazon Redshift knowledge sharing helps each learn and write use circumstances (write in preview on the time of writing). This allows the creation of 360-degree datasets, similar to a buyer dataset that receives contributions from a number of Redshift warehouses throughout totally different enterprise models within the group.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Determine 2: Information mesh structure utilizing Amazon Redshift knowledge sharing

Overview of Redshift Spectrum and knowledge lake tables

Within the fashionable knowledge group, the information lake has emerged as a centralized repository—a single supply of reality the place all knowledge throughout the group finally resides sooner or later in its lifecycle. Redshift Spectrum permits seamless integration between the Redshift knowledge warehouse and clients’ knowledge lakes, as proven within the following diagram. With Redshift Spectrum, you may run SQL queries instantly towards knowledge saved in Amazon Easy Storage Service (Amazon S3), with out the necessity to first load that knowledge right into a Redshift warehouse. This lets you preserve a complete view of your knowledge whereas optimizing for cost-efficiency.

Figure 3: Amazon Redshift bridges the data warehouse and data lake by enabling querying of data lake tables in-place

Determine 3: Amazon Redshift bridges the information warehouse and knowledge lake by enabling querying of knowledge lake tables in-place

Redshift Spectrum helps a wide range of open file codecs, together with Parquet, ORC, JSON, and CSV, in addition to open desk codecs similar to Apache Iceberg, all saved in Amazon S3. It runs these queries utilizing a devoted fleet of high-performance servers with low-latency connections to the S3 knowledge lake. Information lake tables may be added to a Redshift warehouse both mechanically by the Information Catalog, within the Amazon Redshift Question Editor, or manually utilizing SQL instructions.

From a person expertise standpoint, there’s little distinction between querying an area Redshift desk vs. a knowledge lake desk. SQL queries may be reused verbatim to carry out the identical aggregations and transformations on knowledge residing within the knowledge lake, as proven within the following examples. Moreover, by utilizing columnar file codecs like Parquet and pushing down question predicates, you may obtain additional efficiency enhancements.

The next SQL is for a pattern question towards native Redshift tables:

SELECT prime 10 mylocal_schema.gross sales.eventid, sum(mylocal_schema.gross sales.pricepaid) FROM mylocal_schema.gross sales, occasion
WHERE mylocal_schema.gross sales.eventid = occasion.eventid
AND mylocal_schema.gross sales.pricepaid > 30
GROUP BY mylocal_schema.gross sales.eventid
ORDER BY 2 DESC;

The next SQL is for a similar question, however towards knowledge lake tables:

SELECT prime 10 myspectrum_schema.gross sales.eventid, sum(myspectrum_schema.gross sales.pricepaid) FROM myspectrum_schema.gross sales, occasion
WHERE myspectrum_schema.gross sales.eventid = occasion.eventid
AND myspectrum_schema.gross sales.pricepaid > 30
GROUP BY myspectrum_schema.gross sales.eventid
ORDER BY 2 desc;

To take care of sturdy knowledge governance, Redshift Spectrum integrates with AWS Lake Formation, enabling the constant software of safety insurance policies and entry controls throughout each the Redshift knowledge warehouse and S3 knowledge lake. When Lake Formation is used, Redshift producer warehouses first share their knowledge with Lake Formation moderately than instantly with different Redshift shopper warehouses, and the information lake administrator grants fine-grained permissions for Redshift shopper warehouses to entry the shared knowledge. For extra data, see Centrally handle entry and permissions for Amazon Redshift knowledge sharing with AWS Lake Formation.

Previously, nonetheless, sharing knowledge lake tables throughout Redshift warehouses offered challenges. It wasn’t potential to take action with out having to mount the information lake tables on every particular person Redshift warehouse after which recreate the associated safety insurance policies.

This barrier has now been addressed with the introduction of knowledge sharing assist for knowledge lake tables. Now you can share knowledge lake tables identical to another desk, utilizing the built-in knowledge sharing capabilities of Amazon Redshift. By combining the ability of Redshift Spectrum knowledge lake integration with the flexibleness of Amazon Redshift knowledge sharing, organizations can unlock new ranges of cross-team collaboration and insights, whereas sustaining sturdy knowledge governance and safety controls.

For extra details about Redshift Spectrum, see Getting began with Amazon Redshift Spectrum.

Answer overview

On this submit, we describe how you can add knowledge lake tables or views to a Redshift datashare, overlaying two key use circumstances:

  • Including a late-binding view or materialized view to a producer datashare that references a knowledge lake desk
  • Including a knowledge lake desk on to a producer datashare

The primary use case gives larger flexibility and comfort. Customers can question the shared view with out having to configure fine-grained permissions. The configuration, similar to defining permissions on knowledge saved in Amazon S3 with Lake Formation, is already dealt with on the producer facet. You solely want so as to add the view to the producer datashare one time, making it a handy possibility for each the producer and the buyer.

An extra advantage of this strategy is you could add views to a datashare that be part of knowledge lake tables with native Redshift tables. When these views are shared, you may relegate the trusted enterprise logic to simply the producer facet.

Alternatively, you may add knowledge lake tables on to a datashare. On this case, shoppers can question the information lake tables instantly or be part of them with their very own native tables, permitting them so as to add their very own conditional logic as wanted.

Add a view that references a knowledge lake desk to a Redshift datashare

Whenever you create knowledge lake tables that you simply intend so as to add to a datashare, the beneficial and commonest manner to do that is so as to add a view to the datashare that references a knowledge lake desk or tables. There are three high-level steps concerned:

  1. Add the Redshift view’s schema (the native schema) to the Redshift datashare.
  2. Add the Redshift view (the native view) to the Redshift datashare.
  3. Add the Redshift exterior schemas (for the tables referenced by the Redshift view) to the Redshift datashare.

The next diagram illustrates the complete workflow.

Figure 4: Sharing data lake tables via Amazon Redshift views

Determine 4: Sharing knowledge lake tables by way of Amazon Redshift views

The workflow consists of the next steps:

  1. Create a knowledge lake desk on the datashare producer. For extra data on creating Redshift Spectrum objects, see Exterior schemas for Amazon Redshift Spectrum. Information lake tables to be shared can embrace Lake Formation registered tables and Information Catalog tables, and if utilizing the Redshift Question Editor, these tables are mechanically mounted.
  2. Create a view on the producer that references the information lake desk that you simply created.
  3. Create a datashare, if one doesn’t exist already, and add objects to your datashare, together with the view you created that references the information lake desk. For extra data, see Creating datashares and including objects (preview).
  4. Add the exterior schema of the bottom Redshift desk to the datashare (that is true of each native base tables and knowledge lake tables). You don’t have so as to add a knowledge lake desk itself to the datashare.
  5. On the buyer, the administrator makes the view out there to shopper database customers.
  6. Database shopper customers can write queries to retrieve knowledge from the shared view and be part of it with different tables and views on the buyer.

After these steps are full, database shopper customers with entry to the datashare views can reference them of their SQL queries. The next SQL queries are examples for attaining the previous steps.

Create a knowledge lake desk on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.check (c1 INT)
saved AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Create a view on the producer warehouse:

CREATE VIEW mylocal_db.mylocal_schema.myspectrumview AS SELECT c1 FROM myspectrum_db.myspectrum_schema.v_test
WITH no schema binding;

Add a view to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA mylocal_db.mylocal_schema;
ALTER datashare mydatashare ADD VIEW myspectrumview;
ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;

Create a shopper datashare and grant permissions for the view within the shopper warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT utilization ON database myspectrum_db TO usernames;

Add a knowledge lake desk on to a Redshift datashare

Including a knowledge lake desk to a datashare is just like including a view. This course of works effectively for a case the place the shoppers need the uncooked knowledge from the information lake desk and so they need to write queries and be part of it to tables in their very own knowledge warehouse. There are two high-level steps concerned:

  1. Add the Redshift exterior schemas (of the information lake tables to be shared) to the Redshift datashare.
  2. Add the information lake desk (the Redshift exterior desk) to the Redshift datashare.

The next diagram illustrates the complete workflow.

Figure 5: Sharing data lake tables directly in an Amazon Redshift datashare

Determine 5: Sharing knowledge lake tables instantly in an Amazon Redshift datashare

The workflow consists of the next steps:

  1. Create a knowledge lake desk on the datashare producer.
  2. Add objects to your datashare, together with the information lake desk you created. On this case, you don’t have any abstraction over the desk.
  3. On the buyer, the administrator makes the desk out there.
  4. Database shopper customers can write queries to retrieve knowledge from the shared desk and be part of it with different tables and views on the buyer.

The next SQL queries are examples for attaining the previous producer steps.

Create a knowledge lake desk on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.check (c1 INT)
saved AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Add a knowledge lake schema and desk on to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;
ALTER datashare mydatashare ADD TABLE myspectrum_db.myspectrum_schema.check;

Create a shopper datashare and grant permissions for the view within the shopper warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT utilization ON database myspectrum_db TO usernames;

Safety concerns for sharing knowledge lake tables and views

Information lake tables are saved outdoors of Amazon Redshift, within the knowledge lake, and might not be owned by the Redshift warehouse, however are nonetheless referenced inside Amazon Redshift. This setup requires particular safety concerns. Information lake tables function underneath the safety and governance of each Amazon Redshift and the information lake. For Lake Formation registered tables particularly, the Amazon S3 sources are secured by Lake Formation and made out there to shoppers utilizing the supplied credentials.

The information proprietor of the information within the knowledge lake tables could need to impose restrictions on which exterior objects may be added to a datashare. To offer knowledge homeowners extra management over whether or not warehouse customers can share knowledge lake tables, you should utilize session tags in AWS Id and Entry Administration (IAM). These tags present further context concerning the person operating the queries. For extra particulars on tagging sources, confer with Tags for AWS Id and Entry Administration sources.

Audit concerns for sharing knowledge lake tables and views

When sharing knowledge lake objects by a datashare, there are particular logging concerns to remember:

  • Entry controls – You may also use CloudTrail log knowledge along with IAM insurance policies to regulate entry to shared tables, together with each Redshift datashare producers and shoppers. The CloudTrail logs file particulars about who accesses shared tables. The identifiers within the log knowledge can be found within the ExternalId discipline underneath the AssumeRole CloudTrail logs. The information proprietor can configure further limitations on knowledge entry in an IAM coverage via actions. For extra details about defining knowledge entry by insurance policies, see Entry to AWS accounts owned by third events.
  • Centralized entry – Amazon S3 sources similar to knowledge lake tables may be registered and centrally managed with Lake Formation. After they’re registered with Lake Formation, Amazon S3 sources are secured and ruled by the related Lake Formation insurance policies and made out there utilizing the credentials supplied by Lake Formation.

Billing concerns for sharing knowledge lake tables and views

The billing mannequin for Redshift Spectrum differs for Amazon Redshift provisioned and serverless warehouses. For provisioned warehouses, Redshift Spectrum queries (queries involving knowledge lake tables) are billed based mostly on the quantity of knowledge scanned throughout question execution. For serverless warehouses, knowledge lake queries are billed the identical as non-data-lake queries. Storage for knowledge lake tables is at all times billed to the AWS account related to the Amazon S3 knowledge.

Within the case of datashares involving knowledge lake tables, prices are attributed for storing and scanning knowledge lake objects in a datashare as follows:

  • When a shopper queries shared objects from a knowledge lake, the price of scanning is billed to the buyer:
    • When the buyer is a provisioned warehouse, Amazon Redshift makes use of Redshift Spectrum to scan the Amazon S3 knowledge. Due to this fact, the Redshift Spectrum value is billed to the buyer account.
    • When the buyer is an Amazon Redshift Serverless workgroup, there is no such thing as a separate cost for knowledge lake queries.
  • Amazon S3 prices for storage and operations, similar to itemizing buckets, is billed to the account that owns every S3 bucket.

For detailed data on Redshift Spectrum billing, confer with Amazon Redshift pricing and Billing for storage.

Conclusion

On this submit, we explored how Amazon Redshift enhanced knowledge sharing capabilities, together with assist for sharing knowledge lake tables and Redshift views that reference these knowledge lake tables, empower organizations to unlock the complete potential of their knowledge by bringing the complete breadth of knowledge property in scope for superior analytics. Organizations at the moment are capable of seamlessly share native tables and knowledge lake tables throughout warehouses, accounts, and Areas.

We outlined the steps to securely share knowledge lake tables and views that reference these knowledge lake tables throughout Redshift warehouses, even these in separate AWS accounts or Areas. Moreover, we lined some concerns and greatest practices to remember when utilizing this revolutionary characteristic.

Sharing knowledge lake tables and views by Amazon Redshift knowledge sharing champions the trendy, data-driven group’s objective to democratize knowledge entry in a safe, scalable, and environment friendly method. By eliminating the necessity for bodily knowledge motion or duplication, this functionality reduces overhead and permits seamless cross-team and cross-organizational collaboration. Unleashing the complete potential of your knowledge analytics to span the complete breadth of your native tables and knowledge lake tables is just some steps away.

For extra data on Amazon Redshift knowledge sharing and the way it can profit your group, confer with the next sources:

Please additionally attain out to your AWS technical account supervisor or AWS account Options Architect. They are going to be blissful to supply further steerage and assist.


In regards to the Authors

Mohammed Alkateb is an Engineering Supervisor at Amazon Redshift. Previous to becoming a member of Amazon, Mohammed had 12 years of business expertise in question optimization and database internals as a person contributor and engineering supervisor. Mohammed has 18 US patents, and he has publications in analysis and industrial tracks of premier database conferences together with EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Laptop Science from The College of Vermont, and MSc and BSc levels in Data Techniques from Cairo College.

Ramchandra Anil Kulkarni is a software program growth engineer who has been with Amazon Redshift for over 4 years. He’s pushed to develop database improvements that serve AWS clients globally. Kulkarni’s long-standing tenure and dedication to the Amazon Redshift service display his deep experience and dedication to delivering cutting-edge database options that empower AWS clients worldwide.

Mark Lyons is a Principal Product Supervisor on the Amazon Redshift workforce. He works on the intersection of knowledge lakes and knowledge warehouses. Previous to becoming a member of AWS, Mark held product management roles with Dremio and Vertica. He’s enthusiastic about knowledge analytics and empowering clients to alter the world with their knowledge.

Asser Moustafa is a Principal Worldwide Specialist Options Architect at AWS, based mostly in Dallas, Texas. He companions with clients worldwide, advising them on all features of their knowledge architectures, migrations, and strategic knowledge visions to assist organizations undertake cloud-based options, maximize the worth of their knowledge property, modernize legacy infrastructures, and implement cutting-edge capabilities like machine studying and superior analytics. Previous to becoming a member of AWS, Asser held numerous knowledge and analytics management roles, finishing an MBA from New York College and an MS in Laptop Science from Columbia College in New York. He’s enthusiastic about empowering organizations to change into actually data-driven and unlock the transformative potential of their knowledge.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles