AWS Lake Formation makes it easy to centrally govern, safe, and globally share knowledge for analytics and machine studying (ML).
With Lake Formation, you may centralize knowledge safety and governance utilizing the AWS Glue Knowledge Catalog, letting you handle metadata and knowledge permissions in a single place with acquainted database-style options. It additionally delivers fine-grained knowledge entry management, so you can also make certain customers have entry to the proper knowledge all the way down to the row and column degree.
Lake Formation additionally makes it easy to share knowledge internally throughout your group and externally, which helps you to create a knowledge mesh or meet different knowledge sharing wants with no knowledge motion.
Moreover, as a result of Lake Formation tracks knowledge interactions by function and person, it supplies complete knowledge entry auditing to confirm the proper knowledge was accessed by the proper customers on the proper time.
On this two-part collection, we present learn how to combine customized purposes or knowledge processing engines with Lake Formation utilizing the third-party providers integration function.
On this publish, we dive deep into the required Lake Formation and AWS Glue APIs. We stroll via the steps to implement Lake Formation insurance policies inside customized knowledge purposes. For example, we current a pattern Lake Formation built-in utility applied utilizing AWS Lambda.
The second half of the collection introduces a pattern internet utility constructed with AWS Amplify. This internet utility showcases learn how to use the customized knowledge processing engine applied within the first publish.
By the top of this collection, you should have a complete understanding of learn how to lengthen the capabilities of Lake Formation by constructing and integrating your individual customized knowledge processing parts.
Combine an exterior utility
The method of integrating a third-party utility with Lake Formation is described intimately in How Lake Formation utility integration works.
On this part, we dive deeper into the steps required to determine belief between Lake Formation and an exterior utility, the API operations which might be concerned, and the AWS Identification and Entry Administration (IAM) permissions that should be set as much as allow the mixing.
Lake Formation utility integration exterior knowledge filtering
In Lake Formation, it’s doable to regulate which third-party engines or purposes are allowed to learn and filter knowledge in Amazon Easy Storage Service (Amazon S3) places registered with Lake Formation.
To take action, you may navigate to the Utility integration settings web page on the Lake Formation console and allow Permit exterior engines to filter knowledge in Amazon S3 places registered with Lake Formation, specifying the AWS account IDs from the place third-party engines are allowed to entry places registered with Lake Formation. As well as, you need to specify the allowed session tag values to determine trusted requests. We focus on in later sections how these tags are used.
Lake Formation utility integration concerned AWS APIs
The next is an inventory of the primary AWS APIs wanted to combine an utility with Lake Formation:
- sts:AssumeRole – Returns a set of momentary safety credentials that you should use to entry AWS sources.
- glue:GetUnfilteredTableMetadata – Permits a third-party analytical engine to retrieve unfiltered desk metadata from the Knowledge Catalog.
- glue:GetUnfilteredPartitionsMetadata – Retrieves partition metadata from the Knowledge Catalog that comprises unfiltered metadata.
- lakeformation:GetTemporaryGlueTableCredentials – Permits a caller in a safe surroundings to imagine a job with permission to entry Amazon S3. To vend such credentials, Lake Formation assumes the function related to a registered location, for instance an S3 bucket, with a scope down coverage that restricts the entry to a single prefix.
- lakeformation:GetTemporaryGluePartitionCredentials – This API is an identical to
GetTemporaryTableCredentials
besides that it’s used when the goal Knowledge Catalog useful resource is of kindPartition
. Lake Formation restricts the permission of the vended credentials with the identical scope down coverage that restricts entry to a single Amazon S3 prefix.
Later on this publish, we current a pattern structure illustrating how you should use these APIs.
Exterior utility and IAM roles to entry knowledge
For an exterior utility to entry sources in an Lake Formation surroundings, it must run below an IAM principal (person or function) with the suitable credentials. Let’s think about a state of affairs the place the exterior utility runs below the IAM function MyApplicationRole
that’s a part of the AWS account 123456789012
.
In Lake Formation, you’ve gotten granted entry to varied tables and databases to 2 particular IAM roles:
To allow MyApplicationRole
to entry the sources which were granted to AccessRole1
and AccessRole2
, it’s essential configure the belief relationships for these entry roles. Particularly, it’s essential configure the next:
- Permit
MyApplicationRole
to imagine every of the entry roles (AccessRole1
and AccessRole2) utilizing the sts:AssumeRole - Permit
MyApplicationRole
to tag the assumed session with a selected tag, which is required by Lake Formation. The tag key needs to beLakeFormationAuthorizedCaller
, and the worth ought to match one of many session tag values specified within the Utility integration settings web page on the Lake Formation console (for instance, “application1
“).
The next code is an instance of the belief relationships configuration for an entry function (AccessRole1
or AccessRole2
):
Moreover, the info entry IAM roles (AccessRole1
and AccessRole2
) will need to have the next IAM permissions assigned with the intention to learn Lake Formation protected tables:
Resolution overview
For our resolution, Lambda serves as our exterior trusted engine and utility built-in with Lake Formation. This instance is supplied with the intention to perceive and see in motion the entry movement and the Lake Formation API responses. As a result of it’s primarily based on a single Lambda perform, it’s not meant for use in manufacturing settings or with excessive volumes of knowledge.
Furthermore, the Lambda primarily based engine has been configured to help a restricted set of knowledge recordsdata (CSV, Parquet, and JSON), a restricted set of desk configurations (no nested knowledge), and a restricted set of desk operations (SELECT solely). As a result of these limitations, the applying shouldn’t be used for arbitrary exams.
On this publish, we offer directions on learn how to deploy a pattern API utility built-in with Lake Formation that implements the answer structure. The core of the API is applied with a Python Lambda perform. We additionally present learn how to take a look at the perform with Lambda exams. Within the second publish on this collection, we offer directions on learn how to deploy an internet frontend utility that integrates with this Lambda perform.
Entry movement for unpartitioned tables
The next diagram summarizes the entry movement when accessing unpartitioned tables.
The workflow consists of the next steps:
- Consumer A (authenticated with Amazon Cognito or different equal methods) sends a request to the applying API endpoint, requesting entry to a selected desk inside a selected database.
- The API endpoint, created with AWS AppSync, handles the request, invoking a Lambda perform.
- The perform checks which IAM knowledge entry function the person is mapped to. For simplicity, the instance makes use of a static hardcoded mapping (
mappings={ "user1": "lf-app-access-role-1", "user2": "lf-app-access-role-2"}
). - The perform invokes the sts:AssumeRole API to imagine the user-related IAM knowledge entry function (
lf-app-access-role-1AccessRole1
). TheAssumeRole
operation is carried out with the tagLakeFormationAuthorizedCaller
, having as its worth one of many session tag values specified when configuring the applying integration settings in Lake Formation (for instance,{'Key': 'LakeFormationAuthorizedCaller','Worth': 'application1'}
). The API returns a set of momentary credentials, which we confer with as StsCredentials1. - Utilizing
StsCredentials1
, the perform invokes the glue:GetUnfilteredTableMetadata API, passing the requested database and desk identify. The API returns data like desk location, an inventory of licensed columns, and knowledge filters, if outlined. - Utilizing
StsCredentials1
, the perform invokes the lakeformation:GetTemporaryGlueTableCredentials API, passing the requested database and desk identify, the kind of requested entry (SELECT
), andCELL_FILTER_PERMISSION
because the supported permission varieties (as a result of the Lambda perform implements logic to use row-level filters). The API returns a set of momentary Amazon S3 credentials, which we confer with asS3Credentials1
. - Utilizing
S3Credentials1
, the perform lists the S3 recordsdata saved within the desk location S3 prefix and downloads them. - The retrieved Amazon S3 knowledge is filtered to take away these columns and rows that the person isn’t allowed entry to (licensed columns and row filters have been retrieved in Step 5) and licensed knowledge is returned to the person.
Entry movement for partitioned tables
The next diagram summarizes the entry movement when accessing partitioned tables.
The steps concerned are virtually an identical to those offered for partitioned tables, with the next adjustments:
- After invoking the glue:GetUnfilteredTableMetadata API (Step 5) and figuring out the desk as partitioned, the Lambda perform invokes the glue:GetUnfilteredPartitionsMetadata API utilizing
StsCredentials1
(Step 6). The API returns, along with different data, the listing of partition values and places. - For every partition, the perform performs the next actions:
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and desk identify, the partition worth, the kind of requested entry (
SELECT
), andCELL_FILTER_PERMISSION
because the supported permissions kind (as a result of the Lambda perform implements logic to use row-level filters). The API returns a set of momentary Amazon S3 credentials, which we confer with asS3CredentialsPartitionX
. - Makes use of
S3CredentialsPartitionX
to listing the partition location S3 recordsdata and obtain them (Step 8).
- Invokes the lakeformation:GetTemporaryGluePartitionCredentials API (Step 7), passing the requested database and desk identify, the partition worth, the kind of requested entry (
- The perform appends the retrieved knowledge.
- Earlier than the Lambda perform returns the outcomes to the person (Step 9), the retrieved Amazon S3 knowledge is filtered to take away these columns and rows that the person isn’t allowed entry to (licensed columns and row filters have been retrieved in Step 5).
Conditions
The next stipulations are wanted to deploy and take a look at the answer:
- Lake Formation needs to be enabled within the AWS Area the place the pattern utility will likely be deployed
- The steps should be run with an IAM principal with adequate permissions to create the wanted sources, together with Lake Formation databases and tables
Deploy resolution sources with AWS CloudFormation
We create the answer sources utilizing AWS CloudFormation. The supplied CloudFormation template creates the next sources:
- One S3 bucket to retailer desk knowledge (
lf-app-data-<account-id>
) - Two IAM roles, which will likely be mapped to shopper customers and their related Lake Formation permission insurance policies (
lf-app-access-role-1
andlf-app-access-role-2
) - Two IAM roles used for the 2 created Lambda capabilities (
lf-app-lambda-datalake-population-role
andlf-app-lambda-role
) - One AWS Glue database (
lf-app-entities
) with two AWS Glue tables, one unpartitioned (users_tbl
) and one partitioned (users_partitioned_tbl
) - One Lambda perform used to populate the info lake knowledge (
lf-app-lambda-datalake-population
) - One Lambda perform used for the Lake Formation built-in utility (
lf-app-lambda-engine
) - One IAM function utilized by Lake Formation to entry the desk knowledge and carry out credentials merchandising (
lf-app-datalake-location-role
) - One Lake Formation knowledge lake location (
s3://lf-app-data-<account-id>/datasets
) related to the IAM function created for credentials merchandising (lf-app-datalake-location-role
) - One Lake Formation knowledge filter (
lf-app-filter-1
) - One Lake Formation tag (key:
delicate
, values:true
orfalse
) - Tag associations to tag the created unpartitioned AWS Glue desk (
users_tbl
) columns with the created tag
To launch the stack and provision your sources, full the next steps:
- Obtain the code zip bundle for the Lambda perform used for the Lake Formation built-in utility (lf-integrated-app.zip).
- Obtain the code zip bundle for the Lambda perform used to populate the info lake knowledge (datalake-population-function.zip).
- Add the zip bundles to an current S3 bucket location (for instance,
s3://mybucket/myfolder1/myfolder2/lf-integrated-app.zip
ands3://mybucket/myfolder1/myfolder2/datalake-population-function.zip
) - Select Launch Stack.
This robotically launches AWS CloudFormation in your AWS account with a template. Just be sure you create the stack in your supposed Area.
- Select Subsequent to maneuver to the Specify stack particulars part
- For Parameters, present the next parameters:
- For powertoolsLogLevel, specify how verbose the Lambda perform logger needs to be, from probably the most verbose to the least verbose (no logs). For this publish, we select DEBUG.
- For s3DeploymentBucketName, enter the identify of the S3 bucket containing the Lambda capabilities’ code zip bundles. For this publish, we use
mybucket
. - For s3KeyLambdaDataPopulationCode, enter the Amazon S3 location containing the code zip bundle for the Lambda perform used to populate the info lake knowledge (
datalake-population-function.zip
). For instance,myfolder1/myfolder2/datalake-population-function.zip
. - For s3KeyLambdaEngineCode, enter the Amazon S3 location containing the code zip bundle for the Lambda perform used for the Lake Formation built-in utility (
lf-integrated-app.zip
). For instance,myfolder1/myfolder2/lf-integrated-app.zip
.
- Select Subsequent.
- Add further AWS tags if required.
- Select Subsequent.
- Acknowledge the ultimate necessities.
- Select Create stack.
Allow the Lake Formation utility integration
Full the next steps to allow the Lake Formation utility integration:
- On the Lake Formation console, select Utility integration settings within the navigation pane.
- Allow Permit exterior engines to filter knowledge in Amazon S3 places registered with Lake Formation.
- For Session tag values, select
application1
. - For AWS account IDs, enter the present AWS account ID.
- Select Save.
Implement Lake Formation permissions
The CloudFormation stack created one database named lf-app-entities
with two tables named users_tbl
and users_partitioned_tbl
.
To make certain you’re utilizing Lake Formation permissions, you need to affirm that you just don’t have any grants arrange on these tables for the principal IAMAllowedPrincipals
. The IAMAllowedPrincipals
group consists of any IAM customers and roles which might be allowed entry to your Knowledge Catalog sources by your IAM insurance policies, and it’s used to take care of backward compatibility with AWS Glue.
To substantiate Lake Formations permissions are enforced, navigate to the Lake Formation console and select Knowledge lake permissions within the navigation pane. Filter permissions by Database=lf-app-entities
and take away all of the permissions given to the principal IAMAllowedPrincipals
.
For extra particulars on IAMAllowedPrincipals
and backward compatibility with AWS Glue, confer with Altering the default safety settings on your knowledge lake.
Verify the created Lake Formation sources and permissions
The CloudFormation stack created two IAM roles—lf-app-access-role-1
and lf-app-access-role-2
—and assigned them totally different permissions on the users_tbl
(unpartitioned) and users_partitioned_tbl
(partitioned) tables. The particular Lake Formation grants are summarized within the following desk.
IAM Roles |
lf-app-entities (Database) | |
customers _tbl (Desk) | _tbl _partitioned_tbl (Desk) | |
lf-app-access-role-1 |
No entry | Learn entry on columns uid , state , and metropolis for all of the data. Learn entry to all columns apart from handle solely on rows with worth state=uk . |
lf-app-access-role-2 |
Learn entry on columns with the tag delicate = false |
Learn entry to all columns and rows. |
To higher perceive the complete permissions setup, you need to assessment the CloudFormation created Lake Formation sources and permissions. On the Lake Formation console, full the next steps:
- Overview the info filters:
- Select Knowledge filters within the navigation pane.
- Examine the
lf-app-filter-1
- Overview the tags:
- Select LF-Tags and permissions within the navigation pane.
- Examine the
delicate
- Overview the tag associations:
- Select Tables within the navigation pane.
- Select the
users_tbl
- Examine the LF-Tags related to the totally different columns within the Schema
- Overview the Lake Formation permissions:
- Select Knowledge lake permissions within the navigation pane.
- Filter by
Principal = lf-app-access-role-1
and examine the assigned permissions. - Filter by
Principal = lf-app-access-role-2
and examine the assigned permissions.
Check the Lambda perform
The Lambda perform created by the CloudFormation template accepts JSON objects as enter occasions. The JSON occasions have the next construction:
Though the id
subject is all the time wanted with the intention to determine the referred to as id, relying on the requested operation (fieldName
), totally different arguments needs to be supplied. The next desk lists these arguments.
Operation | Description | Wanted Arguments | Output |
getDbs |
Checklist databases | No arguments wanted | Checklist of databases the person has entry to |
getTablesByDb |
Checklist tables | db: <db_name> |
Checklist of tables inside a database the person has entry to |
getUnfilteredTableMetadata |
Return the desk metadata |
|
Returns the output of the glue:GetUnfilteredTableMetadata API |
getUnfilteredPartitionsMetadata |
Return the desk partitions metadata |
|
Returns the output of the glue:GetUnfilteredPartitionsMetadata API |
getTableData |
Get desk knowledge |
|
|
To check the Lambda perform, you may create some pattern Lambda take a look at occasions. Full the next steps:
- On the Lambda console, select Capabilities on the navigation pane.
- Select the
lf-app-lambda-engine
- On the Check tab, choose Create new occasion.
- For Occasion JSON, enter a sound JSON (we offer some pattern JSON occasions).
- Select Check.
- Verify the take a look at outcomes (JSON response).
The next are some pattern take a look at occasions you may attempt to see how totally different identities can entry totally different units of data.
user1 | user2 |
For example, within the following take a look at, we request users_partitioned_tbl
desk knowledge within the context of user1
:
The next is the associated API response:
To troubleshoot the Lambda perform, you may navigate to the Monitoring tab, select View CloudWatch logs, and examine the newest log stream.
Clear up
In case you plan to discover Half 2 of this collection, you may skip this half, as a result of you will have the sources created right here. You’ll be able to confer with this part on the finish of your testing.
Full the next steps to take away the sources you created following this publish and keep away from incurring further prices:
- On the AWS CloudFormation console, select Stacks within the navigation pane.
- Select the stack you created and select Delete.
Further issues
Within the proposed structure, Lake Formation permissions have been granted to particular IAM knowledge entry roles that requesting customers (for instance, the id
subject) have been mapped to. One other chance is to assign permissions in Lake Formation to SAML customers and teams after which work with the AssumeDecoratedRoleWithSAML API.
Conclusion
Within the first a part of this collection, we explored learn how to combine customized purposes and knowledge processing engines with Lake Formation. We delved into the required configuration, APIs, and steps to implement Lake Formation insurance policies inside customized knowledge purposes. For example, we offered a pattern Lake Formation built-in utility constructed on Lambda.
The data supplied on this publish can function a basis for creating your individual customized purposes or knowledge processing engines that must function on an Lake Formation protected knowledge lake.
Discuss with the second half of this collection to see learn how to construct a pattern internet utility that makes use of the Lambda primarily based Lake Formation utility.
Concerning the Authors
Stefano Sandonà is a Senior Massive Knowledge Specialist Resolution Architect at AWS. Enthusiastic about knowledge, distributed methods, and safety, he helps prospects worldwide architect high-performance, environment friendly, and safe knowledge platforms.
Francesco Marelli is a Principal Options Architect at AWS. He specializes within the design, implementation, and optimization of large-scale knowledge platforms. Francesco leads the AWS Resolution Architect (SA) analytics crew in Italy. He loves sharing his skilled information and is a frequent speaker at AWS occasions. Francesco can be keen about music.