AWS Storage Blog
Streamline data sharing and access control with Informatica Cloud Data Marketplace and Amazon S3 Access Grants
Organizations are modernizing their data lakes on Amazon Simple Storage Service (Amazon S3) to handle the ever-growing data volume and speed while meeting the demands of analytics, machine learning (ML), artificial intelligence (AI), and generative AI applications. To enable a data-driven culture and remain innovative, the data platform must allow for data-centric collaboration across business and IT personas in the organization. This means enabling prompt access and secure delivery, while adhering to the regulatory and compliance standards, and meeting the user’s privacy expectations.
However, many organizations find themselves at a crossroads between providing timely access and maintaining the right level of access to authorized users for handling sensitive and other personally identifiable data. To effectively address this challenge, organizations need to transition away from complex, inconsistent, and manual data sharing processes. Additionally, they need to re-evaluate policies that were initially established solely to regulate access. To overcome these obstacles and to lay the foundation for more efficient and secure data management practices, organizations need a modern, cloud-native, automated data governance and access control framework that
- Fosters collaboration across data producers, owners, and consumers through a self-service request and approval workflow for data sharing and delivery.
- Supports fine-grained access controls to protect sensitive information in a data lake at all levels – objects, rows, columns, and even individual cells.
- Scales complex or large permission and access control configurations for data in Amazon S3 across users, roles, and applications.
The post details how organizations can use the integration of the Informatica Intelligent Data Management Cloud™ (IDMC) with Amazon S3 Access Grants to streamline the sharing and access to their data lakes on Amazon S3, while making sure the right set of guardrails are in place to protect sensitive information.
IDMC is an AI-powered, metadata-driven, persona-based, cloud-native platform built to empower data professionals with comprehensive and cohesive cloud data management capabilities to discover, catalog, ingest, cleanse, integrate, govern, secure, prepare, and master data.
Amazon S3 Access Grants
Amazon S3 Access Grants helps you manage Amazon S3 permissions for your data lakes at scale. With S3 Access Grants, you specify permissions in a scalable and intuitive grant-style. Thereafter, when users or applications want to access Amazon S3, they can request temporary, least-privilege credentials from S3 Access Grants. They can then use the S3 Access Grants-vended credentials to access Amazon S3. Additionally, S3 Access Grants log the end-user identity, as well as the application used to access Amazon S3 data, in AWS CloudTrail. This helps provide a detailed audit history for all access to the data in your S3 buckets.
With S3 Access Grants, users can enforce granular, least-privilege Amazon S3 permissions at scale, serving as an easy and scalable way to complement existing resource-level controls such as S3 bucket policies.
As an AWS Data & Analytics partner, Informatica offers simplified and streamlined data sharing and access control that now integrates with Amazon S3 Access Grants. This integration at the launch of the S3 Access Grants feature, highlights their commitment to enhance cloud data management and to provide solutions that address key aspects of data sharing and governance on AWS.
Solution overview
Figure 1: Architecture overview
As shown in Figure 1, this solution involves the following IDMC services:
- The Informatica Cloud Data Marketplace (CDMP) fosters collaboration between data owners and consumers. Data owners can categorize their data for consumers to browse and request access for data that matches their needs and interests. Additionally, CDMP maintains request-approval workflow logs for audit and compliance requirements.
- Within Informatica Cloud Data Governance & Catalog (CDGC), the solution uses Informatica Cloud Data Catalog (CDC) to collect and manage the metadata associated with data assets stored in your data lake. Additionally, it employs Informatica’s data management access policy builder to create fine-grained access rules based on metadata and contextual information derived from CDC.
This solution enables data owners set up fine-grained access rules with Informatica’s policy builder for their data assets, share metadata and other data asset related information (e.g., data quality, data usage policy), and allow data consumers to browse and request access through CDMP. Upon receiving approval, which can happen either through an automated approval process or by authorization from the data owner, a secure version of the dataset is automatically delivered in accordance with the configured policies. Then, the permissions for the newly provisioned data are configured with Amazon S3 Access Grants APIs, promptly fulfilling the data consumer’s request. In CDMP, data owners can get a single pane view of consumers who have access to their data and withdraw access, if needed, through an automated workflow.
Solution walkthrough
Figure 2: Solution workflow
In this scenario (as shown in the Figure 2), we have three data community personas:
- A data steward who centrally defines and establishes protection and usage policy across the organization.
- A data owner responsible for data quality, enrichment, and curation of the billing dataset to maximize its usefulness for the company, while making sure that only authorized users have the right level of access for the necessary duration.
- A data scientist (consumer) requesting access to the sensitive billing data, available in an Amazon S3 data lake, to analyze and predict user behavior based on changes in the pricing model.
Figure 3: Informatica Data Access Policy Builder
The data steward defines the protection policy (as shown in the Figure 3) for their organization for the dataset pertaining to the user billing. When new billing data is added to the data lake and classified in the billing domain by Informatica’s AI engine, the defined policies are automatically associated to the dataset based on the underlying data model and data column classification. Based on the defined policy, the ‘Email’, ‘First Name’, and ‘Social Security Number’ are tokenized using consistent hashing to preserve referential integrity.
Figure 4: Informatica Cloud Data Marketplace
Using CDMP (shown in Figure 4), the data owner shares detailed information about the billing dataset, such as its quality and lineage. This helps data scientists and other data consumers in the organization quickly understand the dataset’s lifecycle and features, making it easier for them to decide whether to request access. Additionally, the data owner also defines a set of delivery targets specifying how and where the dataset is provisioned for the consumer. In this case, given the sensitive information in the billing dataset, the data owner declares the delivery target as ‘CDAM – Amazon S3 Access Grants’ for the billing dataset to be consumed subject to both Informatica’s protection policy and the access permission defined in S3 Access Grants.
Figure 5: Data access request workflow (CDMP)
The data scientist explores categories of data assets in CDMP. Upon finding the needed billing dataset for training a model to predict consumer behavior with pricing changes, they submit an order to access the sensitive dataset. As part of this order, the data scientist also declares their intended use, provides a business justification, and selects ‘CDAM – Amazon S3 Access Grants’ as the delivery target (as shown in Figure 5).
Figure 6: Data access approval workflow (CDMP)
The billing data owner gets a notification for a new order pending approval. Once approved, an automated workflow kicks in to enforce data access protection policy as outlined by the data steward. Following this, an unidentified and protected copy of the dataset is provisioned for the data scientist. Finally, the needed object level access is granted with Amazon S3 Access Grants APIs, allowing the data consumer to access the unidentified copy of the billing dataset. After the fulfillment process concludes, the data scientist receives a notification confirming the order’s completion. The notification includes details about the provisioned dataset. The entire timeline (as shown in Figure 6) of the access fulfillment process is maintained for audit. For a dataset without any sensitive information, the data owner can also configure automatic approval workflow within CDMP.
Figure 7: Access data from Amazon SageMaker using Amazon S3 Access Grants
The data scientist can now start training the model within Amazon SageMaker using the billing dataset. The data scientist uses the Amazon S3 Access Grants SDK in SageMaker (as shown in Figure 7) to receive the necessary credentials for reading their unidentified data. The data is also subject to the protections of the policy defined by the data steward in Informatica’s data access management. The first name, email addresses, and social security number in the billing dataset are tokenized.
Figure 8: Data Access Withdrawn (CDMP)
After the data scientist finishes training the model in Amazon SageMaker and no longer needs access to the data, they can request a withdrawal of access through CDMP. Additionally, the data owner can revoke access at any time, if necessary.
Figure 9: Data access denied within Amazon SageMaker
Conclusion
In this post, we illustrated a streamlined, self-service data access management solution, granting data stewards the ability to enforce data protection measures. This makes sure of appropriate data usage and controlled access for authorized data consumers, all without sacrificing the prompt access and delivery of data. This approach plays a crucial role in fostering collaboration and data sharing across organization and building a data-driven culture.