AWS Partner Network (APN) Blog
How Protegrity Helps Protect PII and PHI Data at Scale on Amazon S3 with AWS Lambda
By Muneeb Hasan, Sr. Partner Solution Engineer – Protegrity
By Tamara Astakhova, Sr. Partner Solution Architect – AWS
By Venkatesh Aravamudan, Partner Solution Architect – AWS
Protegrity |
With the ever-growing need for enterprise data to migrate to the cloud, and the necessity of keeping that data secure, organizations are searching for tools that enable migration while meeting regulatory requirements for data security and privacy.
To meet these needs for customers, Protegrity, an AWS Data and Analytics Competency Partner and global leader in data security, has introduced new solutions leveraging Amazon Simple Storage Service (Amazon S3), and these products are now available via AWS Marketplace:
- Cloud Protect for S3 Financial Markets
- Cloud Protect for S3 Healthcare
- Cloud Protect for S3 Manufacturing
Cloud Protect for S3 enables you to secure your sensitive data in Amazon S3 with Protegrity technology such as tokenization.
Protegrity provides a data tokenization Rest API endpoint that is built on a serverless AWS Lambda architecture. The product is designed to scale elastically and yield reliable query performance under high concurrent loads, and it is ideal for complex extract, transform, load (ETL) use cases.
In this post, we will describe how customers can use Protegrity products to accelerate their personally identifiable information (PII) and personal health information (PHI) data protection efforts at scale in the cloud.
About Tokenization
Tokenization is a common data protection technique used by organizations to reduce risk of exposure and meet regulatory and privacy compliance requirements. In privacy use cases, for example, tokens are substituted for fields such as PII, protecting the identity of the individual while preserving the analytical value of the data.
Protected data is liberated to move from private networks to the public cloud, into software-as-a-service (SaaS) platforms, or third-party processors without increasing risk of exposure or jeopardizing privacy and regulatory requirements.
Tokenized data embeds an inherent access control layer that follows the data regardless of where it resides: databases, data warehouses, data lakes, or cloud storage. When an authorized user needs access, the token can be swapped for the original value by an authorized user based on their need to know.
Figure 1 shows an example of tokenized or de-identified PII data preserving potential analytic usability. The email is tokenized while the domain name is kept in the clear. The date of birth (DOB) is tokenized except for the year. Other fields in green are fully tokenized.
This example tokenization strategy allows for age-based analytics for balance, credit, and medical.
Figure 1 – Example tokenized data.
Solution Overview and Architecture
In this solution, Protegrity built an ETL process using two Amazon S3 buckets to separate the zones for input data and output data, as well as a landing zone for incoming sensitive data and a processed zone for protected data stores the resulting protected data.
The S3 protector is triggered as new files land in the landing zone bucket, and it reads and processes the data based on a configuration file. The protected data, meanwhile, is written to a file in the processed zone bucket. The protected data can be the basis of a secure data lake for Amazon Athena or Amazon EMR, or loaded into a data warehouse such as Amazon Redshift.
Protegrity’s Cloud Protect offers protectors for these services and enable authorized users to unprotect the data on read.
The S3 protector supports the following file formats:
- Text formats (comma-delimited, tab-delimited, custom)
- Parquet
- JSON
- Excel
* Files may be optionally gzipped.
The Cloud Protect S3 solution is deployed on AWS Lambda and invokes the Protegrity Cloud API on AWS to protect the data.
Figure 2 – Protegrity S3 file protector architecture diagram.
The solution scales to process thousands of files in parallel or up to regional AWS quotes. A separate Lambda instance is used to processes each file so there’s an upper file size based on the Lambda timeout period or files up to approximately 3 GB. However, larger files can be split to provide greater parallelism and ensure processing can be completed within the maximum 15-minute Lambda timeout period.
Below are example benchmarks for different CSV file sizes:
Figure 3 – Benchmark example.
Using the Cloud Protect for S3
Cloud Protect for S3 is available via AWS Marketplace as a self-hosted solution that’s installed on your account.
For this post, we assume you have subscribed to Cloud Protect for S3. If not, you can take advantage of a 30-day free trial; you will only be charged for the AWS resources used in your account.
Once registered, log in to the Protegrity Cloud Protect SaaS Portal and create a deployment. For additional information, visit Support – Protegrity Cloud Protect.
First, you will need to set up the Amazon S3 platform.
Log in to the Cloud Protect SaaS portal, select the deployment you have created, click on Setup platform, select the Amazon S3 icon, and click Continue.
Next, you can use landing zone and processed zone S3 buckets already in your AWS account or let the installation create them for you.
Figure 4 – Provide bucket names.
Click on Start Install to open the CloudFormation template in your AWS account.
AWS CloudFormation creates and configures the Lambda function and the trigger to run when files land in the S3 landing zone bucket. It also configures the permissions to call the Cloud API Lambda created for your deployment.
Figure 5 – Click Start Install for setup.
Click on Done, and you are ready to start protecting data files.
Protecting PII Data in CSV File
The userdata.csv file contains fake PII data such as FIRST_NAME, LAST_NAME, EMAIL, SSN, STREET, CITY, BIRTHDAY, IBAN, and CC.
Figure 6 – Example of CSV file with sensitive field/data.
Next, you will need to set the mapping.json configuration file, which provides the processing instructions for files. It must be placed in the S3 folder where incoming files of a particular dataset will land. A folder should be designated for each distinct dataset type (files with the same schema and format).
The following example mapping file would protect the columns FIRST_NAME, LAST_NAME, EMAIL, SSN, STREET, CITY, BIRTHDAY, IBAN, and CC. Only columns with sensitive data should be declared. The S3 protector will copy any fields not defined in the mapping file.
Log into the AWS Management Console where the solution was installed, and open the S3 Raw bucket defined during the installation.
Create a directory named “CSV” and upload the mapping.json and the userdata.csv files into that directory.
Once the Lambda function finished its job, the sample.csv file will be removed, and the protected CSV file will be stored in the processed zone bucket under the “CSV“ directory. Notice that all PII columns were replaced with tokenized data.
Figure 7 – CSV file with tokenized data after S3 protection.
More sample data and mapping.json files are available in the Cloud Protect SaaS portal documentation.
Figure 8 – More sample data for different data formats.
Conclusion
By performing tokenization operations at scale using Protegrity Cloud Protect for S3, customers can accelerate their PII and PHI data protection at scale on AWS.
Cloud Protect for S3 is a cost-effective, scalable, and performant ETL tool for protecting data in the cloud. Use it to reduce your business barriers to migrating to the cloud and unlock the massive potential of cloud technologies.
Cloud Protect for S3 is available through AWS Marketplace:
- Cloud Protect for S3 Financial Markets
- Cloud Protect for S3 Healthcare
- Cloud Protect for S3 Manufacturing
To learn more about the Protegrity Cloud Protect, visit the Protegrity website.
Protegrity – AWS Partner Spotlight
Protegrity is an AWS Data and Analytics Competency Partner that provides fine-grained data protection capabilities (tokenization, encryption, masking) for sensitive data and compliance.