AWS Big Data Blog
Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation
Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Many organizations have a distributed tools and infrastructure across various business units. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.
Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another. Customers want to be able to manage their permissions in a central place across all of their assets. Previously, the management of Redshift datashares was limited to only within Amazon Redshift, which made it difficult to manage your data lake permissions and Amazon Redshift permissions in a single place. For example, you had to navigate to an individual account to view and manage access information for Amazon Redshift and the data lake on Amazon Simple Storage Service (Amazon S3). As an organization grows, administrators want a mechanism to effectively and centrally manage data sharing across data lakes and data warehouses for governance and auditing, and to enforce fine-grained access control.
We recently announced the integration of Amazon Redshift data sharing with AWS Lake Formation. With this feature, Amazon Redshift customers can now manage sharing, apply access policies centrally, and effectively scale the permission using LF-Tags.
Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles. Lake Formation also provides tag-based access control (TBAC), which can be used to simplify and scale governance of data catalog objects such as databases and tables.
In this post, we discuss this new feature and how to implement TBAC for your data lake and Amazon Redshift data sharing on Lake Formation.
Solution overview
Lake Formation tag-based access control (LF-TBAC) allows you to group similar AWS Glue Data Catalog resources together and define the grant or revoke permissions policy by using an LF-Tag expression. LF-Tags are hierarchical in that when a database is tagged with an LF-Tag, all tables in that database inherit the tag, and when a LF-Tag is applied to a table, all the columns within that table inherit the tag. Inherited tags then can be overridden if needed. You then can create access policies within Lake Formation using LF-Tag expressions to grant principals access to tagged resources using an LF-Tag expression. See Managing LF-Tags for metadata access control for more details.
To demonstrate LF-TBAC with central data access governance capability, we use the scenario where two separate business units own particular datasets and need to share data across teams.
We have a customer care team who manages and owns the customer information database including customer demographics data. And have a marketing team who owns a customer leads dataset, which includes information on prospective customers and contact leads.
To be able to run effective campaigns, the marketing team needs access to the customer data. In this post, we demonstrate the process of sharing this data that is stored in the data warehouse and giving the marketing team access. Furthermore, there are personally identifiable information (PII) columns within the customer dataset that should only be accessed by a subset of power users on a need-to-know basis. This way, data analysts within marketing can only see non-PII columns to be able to run anonymous customer segment analysis, but a group of power users can access PII columns (for example, customer email address) to be able to run campaigns or surveys for specific groups of customers.
The following diagram shows the structure of the datasets that we work with in this post and a tagging strategy to provide fine-grained column-level access.
Beyond our tagging strategy on the data resources, the following table gives an overview of how we should grant permissions to our two personas via tags.
IAM Role | Persona | Resource Type | Permission | LF-Tag expression |
marketing-analyst | A data analyst in the marketing team | DB | describe | (department:marketing OR department:customer) AND classification:private |
. | Table | select | (department:marketing OR department:customer) AND classification:private | |
. | . | . | . | . |
marketing-poweruser | A privileged user in the marketing team | DB | describe | (department:marketing OR department:customer) AND classification: private |
. | Table (Column) | select | (department:marketing OR department:customer) AND (classification:private OR classification:pii-sensitive) |
The following diagram gives a high-level overview of the setup that we deploy in this post.
The following is a high-level overview of how to use Lake Formation to control datashare permissions:
Producer Setup:
- In the producers AWS account, the Amazon Redshift administrator that owns the customer database creates a Redshift datashare on the producer cluster and grants usage to the AWS Glue Data Catalog in the same account.
- The producer cluster administrator authorizes the Lake Formation account to access the datashare.
- In Lake Formation, the Lake Formation administrator discovers and registers the datashares. They must discover the AWS Glue ARNs they have access to and associate the datashares with an AWS Glue Data Catalog ARN. If you’re using the AWS Command Line Interface (AWS CLI), you can discover and accept datashares with the Redshift CLI operations describe-data-shares and associate-data-share-consumer. To register a datashare, use the Lake Formation CLI operation register-resource.
- The Lake Formation administrator creates a federated database in the AWS Glue Data Catalog; assigns tags to the databases, tables, and columns; and configures Lake Formation permissions to control user access to objects within the datashare. For more information about federated databases in AWS Glue, see Managing permissions for data in an Amazon Redshift datashare.
Consumer Setup:
- On the consumer side (marketing), the Amazon Redshift administrator discovers the AWS Glue database ARNs they have access to, creates an external database in the Redshift consumer cluster using an AWS Glue database ARN, and grants usage to database users authenticated with IAM credentials to start querying the Redshift database.
- Database users can use the views
SVV_EXTERNAL_TABLES
andSVV_EXTERNAL_COLUMNS
to find all the tables or columns within the AWS Glue database that they have access to; then they can query the AWS Glue database’s tables.
When the producer cluster administrator decides to no longer share the data with the consumer cluster, the producer cluster administrator can revoke usage, deauthorize, or delete the datashare from Amazon Redshift. The associated permissions and objects in Lake Formation are not automatically deleted.
Prerequisites:
To follow the steps in this post, you must satisfy the following prerequisites:
- You need an AWS account. If you don’t have an account, you can create one.
- To run AWS CLI commands, you need to set up AWS CloudShell in your account or the AWS CLI on your workstation. For instructions, refer to Getting started with AWS CloudShell or Set up the AWS CLI, respectively.
- You have completed the initial setup of Lake Formation, including changing the default permission model and creating a data lake administrator role. Take note of this role’s ARN to use later in the steps. For simplicity sake, you can assign the
AdministratorAccess
IAM policy to this role, but make sure that in your environment you follow the least privilege principal.
Deploy environment including producer and consumer Redshift clusters
To follow along the steps outlined in this post, deploy following AWS CloudFormation stack that includes necessary resources to demonstrate the subject of this post:
- Choose Launch stack to deploy a CloudFormation template.
- Provide an IAM role that you have already configured as a Lake Formation administrator.
- Complete the steps to deploy the template and leave all settings as default.
- Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
This CloudFormation stack creates the following resources:
- Producer Redshift cluster – Owned by the customer care team and has customer and demographic data on it.
- Consumer Redshift cluster – Owned by the marketing team and is used to analyze data across data warehouses and data lakes.
- S3 data lake – Contains the web activity and leads datasets.
- Other necessary resources to demonstrate the process of sharing data – For example, IAM roles, Lake Formation configuration, and more. For a full list of resources created by the stack, examine the CloudFormation template.
After you deploy this CloudFormation template, resources created will incur cost to your AWS account. At the end of the process, make sure that you clean up resources to avoid unnecessary charges.
After the CloudFormation stack is deployed successfully (status shows as CREATE_COMPLETE), take note of the following items on the Outputs tab:
- Marketing analyst role ARN
- Marketing power user role ARN
- URL for Amazon Redshift admin password stored in AWS Secrets Manager
Create a Redshift datashare and add relevant tables
On the AWS Management Console, switch to the role that you nominated as Lake Formation admin when deploying the CloudFormation template. Then go to Query Editor v2. If this is the first time using Query Editor V2 in your account, follow these steps to configure your AWS account.
The first step in Query Editor is to log in to the customer Redshift cluster using the database admin credentials to make your IAM admin role a DB admin on the database.
- Choose the options menu (three dots) next to the
lfunified-customer-dwh cluster
and choose Create connection.
- Select Database user name and password.
- Leave Database as
dev
. - For User name, enter
admin
. - For Password, complete the following steps:
- Go to the console URL, which is the value of the
RedShiftClusterPassword
CloudFormation output in previous step. The URL is the Secrets Manager console for this password. - Scroll down to the Secret value section and choose Retrieve secret value.
- Take note of the password to use later when connecting to the marketing Redshift cluster.
- Enter this value for Password.
- Go to the console URL, which is the value of the
- Choose Create connection.
Create a datashare using a SQL command
Complete the following steps to create a datashare in the data producer cluster (customer care) and share it with Lake Formation:
- On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
- Choose (right-click) the cluster name and choose Edit connection or Create connection.
- For Authentication, select Temporary credentials using your IAM identity.
Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.
- For Database, enter a database name (for this post,
dev
). - Choose Create connection to connect to the database.
- Run the following SQL commands to create the datashare and add the data objects to be shared:
- Run the following SQL command to share the customer datashare to the current account via the AWS Glue Data Catalog:
- Verify the datashare was created and objects shared by running the following SQL command:
Take note of the datashare producer cluster name space and account ID, which will be used in the following step. You can complete the following actions on the console, but for simplicity, we use AWS CLI commands.
- Go to CloudShell or your AWS CLI and run the following AWS CLI command to authorize the datashare to the Data Catalog so that Lake Formation can manage them:
The following is an example output:
Take note of your datashare ARN that you used in this command to use in the next steps.
Accept the datashare in the Lake Formation catalog
To accept the datashare, complete the following steps:
- Run the following AWS CLI command to accept and associate the Amazon Redshift datashare to the AWS Glue Data Catalog:
The following is an example output:
- Register the datashare in Lake Formation:
- Create the AWS Glue database that points to the accepted Redshift datashare:
- To verify, go to the Lake Formation console and check that the database
customer_db_shared
is created.
Now the data lake administrator can view and grant access on both the database and tables to the data consumer team (marketing) personas using Lake Formation TBAC.
Assign Lake Formation tags to resources
Before we grant appropriate access to the IAM principals of the data analyst and power user within the marketing team, we have to assign LF-tags to tables and columns of the customer_db_shared
database. We then grant these principals permission to appropriate LF-tags.
To assign LF-tags, follow these steps:
- Assign the department and classification LF-tag to
customer_db_shared
(Redshift datashare) based on the tagging strategy table in the solution overview. You can run the following actions on the console, but for this post, we use the following AWS CLI command:
If the command is successful, you should get a response like the following:
- Assign the appropriate department and classification LF-tag to
marketing_db
(on the S3 data lake):
Note that although you only assign the department and classification tag on the database level, it gets inherited by the tables and columns within that database.
- Assign the classification
pii-sensitive
LF-tag to PII columns of thecustomer
table to override the inherited value from the database level:
Grant permission based on LF-tag association
Run the following two AWS CLI commands to allow the marketing data analyst access to the customer table excluding the pii-sensitive
(PII) columns. Replace the value for DataLakePrincipalIdentifier
with the MarketingAnalystRoleARN
that you noted from the outputs of the CloudFormation stack:
We have now granted marketing analysts access to the customer database and tables that are not pii-sensitive
.
To allow marketing power users access to table columns with restricted LF-tag (PII columns), run the following AWS CLI command:
We can combine the grants into a single batch grant permissions call:
Validate the solution
In this section, we go through the steps to test the scenario.
Consume the datashare in the consumer (marketing) data warehouse
To enable the consumers (marketing team) to access the customer data shared with them via the datashare, first we have to configure Query Editor v2. This configuration is to use IAM credentials as the principal for the Lake Formation permissions. Complete the following steps:
- Sign in to the console using the admin role you nominated in running the CloudFormation template step.
- On the Amazon Redshift console, go to Query Editor v2.
- Choose the gear icon in the navigation pane, then choose Account settings.
- Under Connection settings, select Authenticate with IAM credentials.
- Choose Save.
Now let’s connect to the marketing Redshift cluster and make the customer database available to the marketing team.
- Choose the options menu (three dots) next to the
Serverless:lfunified-marketing-wg
cluster and choose Create connection. - Select Database user name and password.
- Leave Database as
dev
. - For User name, enter
admin
. - For Password, enter the same password you retrieved from Secrets Manger in an earlier step.
- Choose Create connection.
- Once successfully connected, choose the plus sign and choose Editor to open a new Query Editor tab.
- Make sure that you specify the
Serverless: lfunified-marketing-wg workgroup
anddev
database.
- To create the Redshift database from the shared catalog database, run the following SQL command on the new tab:
- Run the following SQL commands to create and grant usage on the Redshift database to the IAM roles for the power users and data analyst. You can get the IAM role names from the CloudFormation stack outputs:
Create the data lake schema in AWS Glue and allow the marketing power role to query the lead and web activity data
Run the following SQL commands to make the lead data in the S3 data lake available to the marketing team:
Query the shared dataset as a marketing analyst user
To validate that the marketing team analysts (IAM role marketing-analyst-role) have access to the shared database, perform the following steps:
- Sign in to the console (for convenience, you can use a different browser) and switch your role to
lf-redshift-ds-MarketingAnalystRole-XXXXXXXXXXXX
. - On the Amazon Redshift console, go to Query Editor v2.
- To connect to the consumer cluster, choose the
Serverless: lfunified-marketing-wg
consumer data warehouse in the navigation pane. - When prompted, for Authentication, select Federated user.
- For Database, enter the database name (for this post,
dev
). - Choose Save.
- Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
- To find the federated databases created on the consumer account, run the following SQL command:
- To validate permissions for the marketing analyst role, run the following SQL command:
As you can see in the following screenshot, the marketing analyst is able to successfully access the customer data but only the non-PII attributes, which was our intention.
- Now let’s validate that the marketing analyst doesn’t have access to the PII columns of the same table:
Query the shared datasets as a marketing power user
To validate that the marketing power users (IAM role lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY
) have access to pii-sensetive
columns in the shared database, perform the following steps:
- Sign in to the console (for convenience, you can use a different browser) and switch your role to
lf-redshift-ds-MarketingPoweruserRole-YYYYYYYYYYYY
. - On the Amazon Redshift console, go to Query Editor v2.
- To connect to the consumer cluster, choose the
Serverless: lfunified-marketing-wg
consumer data warehouse in the navigation pane. - When prompted, for Authentication, select Federated user.
- For Database, enter the database name (for this post,
dev
). - Choose Save.
- Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
- Now let’s validate that the marketing power role has access to the PII columns of the customer table:
- Validate that the power users within the marketing team can now run a query to combine data across different datasets that they have access to in order to run effective campaigns:
Clean up
After you complete the steps in this post, to clean up resources, delete the CloudFormation stack:
- On the AWS CloudFormation console, select the stack you deployed in the beginning of this post.
- Choose Delete and follow the prompts to delete the stack.
Conclusion
In this post, we showed how you can use Lake Formation tags and manage permissions for your data lake and Amazon Redshift data sharing using Lake Formation. Using Lake Formation LF-TBAC for data governance helps you manage your data lake and Amazon Redshift data sharing permissions at scale. Also, it enables data sharing across business units with fine-grained access control. Managing access to your data lake and Redshift datashares in a single place enables better governance, helping with data security and compliance.
If you have questions or suggestions, submit them in the comments section.
For more information on Lake Formation managed Amazon Redshift data sharing and tag-based access control, refer to Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation and Easily manage your data lake at scale using AWS Lake Formation Tag-based access control.
About the Authors
Praveen Kumar is an Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and ML applications.
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.
Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure as code, serverless technologies, and coding in Python.
Mostafa Safipour is a Solutions Architect at AWS based out of Sydney. He works with customers to realize business outcomes using technology and AWS. Over the past decade, he has helped many large organizations in the ANZ region build their data, digital, and enterprise workloads on AWS.