AWS Big Data Blog
Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
Amazon Redshift, a cloud data warehouse service, supports attaching dynamic data masking (DDM) policies to paths of SUPER data type columns, and uses the OBJECT_TRANSFORM function with the SUPER data type. SUPER data type columns in Amazon Redshift contain semi-structured data like JSON documents. Previously, data masking in Amazon Redshift only worked with regular table columns, but now you can apply masking policies specifically to elements within SUPER columns. For example, you could apply a masking policy to mask sensitive fields like credit card numbers within JSON documents stored in a SUPER column. This allows for more granular control over data masking in Amazon Redshift. Amazon Redshift gives you more flexibility in how you apply data masking to protect sensitive information stored in SUPER columns containing semi-structured data.
With DDM support in Amazon Redshift, you can do the following:
- Define masking policies that apply custom obfuscation policies, such as masking policies to handle credit card, personally identifiable information (PII) entries, HIPAA or GDPR needs, and more
- Transform the data at query time to apply masking policies
- Attach masking policies to roles or users
- Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
- Implement cell-level masking by using conditional columns when creating your masking policy
- Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)
In this post, we demonstrate how a retail company can control the access of PII data stored in the SUPER data type to users based on their access privilege without duplicating the data.
Solution overview
For our use case, we have the following data access requirements:
- Users from the Customer Service team should be able to view the order data but not PII information
- Users from the Sales team should be able to view customer IDs and all order information
- Users from the Executive team should be able to view all the data
- Staff should not be able to view any data
The following diagram illustrates how DDM support in Amazon Redshift policies works with roles and users for our retail use case.
The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:
- A data masking policy that defines the data obfuscation rules
- Roles with different access levels depending on the business case
- The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution
Prerequisites
To implement this solution, you need the following prerequisites:
- An AWS account.
- An Amazon Redshift cluster or an Amazon Redshift Serverless endpoint.
- Superuser privilege or the
sys:secadmin
role on the Amazon Redshift data warehouse
Prepare the data
To set up our use case, complete the following steps:
- On the Amazon Redshift console, choose Query editor v2 under Explorer in the navigation pane.
If you’re familiar with SQL Notebooks, you can download the SQL notebook for the demonstration and import it to quickly get started.
- Create the table and populate contents:
Implement the solution
To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:
- Create users and roles, and add users to their respective roles:
- Create masking policies:
- Attach the masking policies:
- Attach the masking policy for the customer service use case:
- Attach the masking policy for the sales use case:
- Attach the masking policy for the staff use case:
Test the solution
Let’s confirm that the masking policies are created and attached.
- Check that the masking policies are created with the following code:
- Check that the masking policies are attached:
Now you can test that different users can see the same data masked differently based on their roles.
- Test that the customer support can’t see customer PHI/PII data but can see the order ID, order details, and status:
- Test that the sales team can see the customer ID (non PII data) and all order information:
- Test that the executives can see all data:
- Test that the staff can’t see any data about the order. All columns should masked for them.
Object_Transform function
In this section, we dive into the capabilities and benefits of the OBJECT_TRANSFORM function and explore how it empowers you to efficiently reshape your data for analysis. The OBJECT_TRANSFORM function in Amazon Redshift is designed to facilitate data transformations by allowing you to manipulate JSON data directly within the database. With this function, you can apply transformations to semi-structured or SUPER data types, making it less complicated to work with complex data structures in a relational database environment.
Let’s look at some usage examples.
First, create a table and populate contents:
Apply the transformations with the OBJECT_TRANSFORM
function:
As you can see in the example, by applying the transformation with OBJECT_TRANSFORM
, the person name is formatted in lowercase and the salary is increased by 10%. This demonstrates how the transformation makes is less complicated to work with semi-structured or nested data types.
Clean up
When you’re done with the solution, clean up your resources:
- Detach the masking policies from the table:
- Drop the masking policies:
- Revoke or drop the roles and users:
- Drop the table:
Considerations and best practices
Consider the following when implementing this solution:
- When attaching a masking policy to a path on a column, that column must be defined as the SUPER data type. You can only apply masking policies to scalar values on the SUPER path. You can’t apply masking policies to complex structures or arrays.
- You can apply different masking policies to multiple scalar values on a single SUPER column as long as the SUPER paths don’t conflict. For example, the SUPER paths a.b and a.b.c conflict because they’re on the same path, with a.b being the parent of a.b.c. The SUPER paths a.b.c and a.b.d don’t conflict.
Refer to Using dynamic data masking with SUPER data type paths for more details on considerations.
Conclusion
In this post, we discussed how to use DDM support for the SUPER data type in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.
We encourage you to create your own user-defined functions for various use cases and achieve your desired security posture using dynamic data masking support in Amazon Redshift.
About the Authors
Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.
Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.
Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.