AWS Machine Learning Blog
Large-scale feature engineering with sensitive data protection using AWS Glue interactive sessions and Amazon SageMaker Studio
Organizations are using machine learning (ML) and AI services to enhance customer experience, reduce operational cost, and unlock new possibilities to improve business outcomes. Data underpins ML and AI use cases and is a strategic asset to an organization. As data is growing at an exponential rate, organizations are looking to set up an integrated, cost-effective, and performant data platform in order to preprocess data, perform feature engineering, and build, train, and operationalize ML models at scale. To achieve that, AWS offers a unified modern data platform that is powered by Amazon Simple Storage Service (Amazon S3) as the data lake with purpose-built tools and processing engines to support analytics and ML workloads. For a unified ML experience, you can use Amazon SageMaker Studio, which offers native integration with AWS Glue interactive sessions to perform feature engineering at scale with sensitive data protection. In this post, we demonstrate how to implement this solution.
Amazon SageMaker is a fully managed ML service that enables you to build, train, and deploy models at scale for a wide range of use cases. For model training, you can use any of the built-in algorithms within SageMaker to get started on training and deploying ML models quickly.
A key component of the model building and development process is feature engineering. AWS Glue is one of the recommended options to achieve feature engineering at scale. AWS Glue enables you to run data integration and transformation in a distributed fashion on a serverless Apache Spark infrastructure, and makes it easy to use the popular Spark ML library for feature engineering and model development. In addition, you can use AWS Glue for incremental data processing through job bookmarks, ingest data from over 100 sources using connectors, and run spiky or unpredictable workloads using auto scaling.
Another important requirement for ML-based applications is data security and access control. It’s common demand to have tighter control on who can access the most sensitive data as part of the feature engineering and model building process by following the principal of least privilege access. To achieve this, you can utilize the AWS Glue integration with AWS Lake Formation for increased governance and management of data lake assets. With Lake Formation, you can configure fine-grained data access control and security policies on top of your Amazon S3 data lake. The policies are defined in a central location, allowing multiple analytics and ML services, such as AWS Glue, Amazon Athena, and SageMaker, to interact with data stored in Amazon S3.
AWS Glue includes a personally identifiable information (PII) detection transform that provides the ability to detect, mask, or remove entities as required, for increased compliance and governance. With the PII transform, you can detect PII data in datasets and automatically apply fine-grained access control using Lake Formation to restrict sensitive data for different user groups.
Use case
We focus on a propensity model use case that includes a customer marketing dataset and involves two user personas: a data engineer and data scientist. The dataset contains per-customer information, including lead source, contact notes, job role, some flags, page views per visit, and more. The dataset also includes sensitive information like personal phone numbers.
The data engineer is responsible for building the end-to-end data processing pipeline, including data preparation, preprocessing, and access control. The data scientist is responsible for feature engineering, and training and deploying the ML model. Note that the data scientist is not allowed to access any PII sensitive data for feature engineering or training the ML model.
As part of this use case, the data engineer builds a data pipeline to preprocess the dataset, scans the dataset for any PII information, and restricts the access of the PII column to the data scientist user. As a result, when a data scientist uses the dataset to perform feature engineering and build ML models, they don’t have access to the PII sensitive column (phone numbers, in this case). The feature engineering process involves converting columns of type string to a format that is optimal for ML models. As an advanced use case, you can extend this access pattern to implement row-level and cell-level security using Lake Formation.
Solution overview
The solution contains the following high-level steps:
- Set up resources with AWS CloudFormation.
- Preprocess the dataset, including PII detection and fine-grained access control, on an AWS Glue interactive session.
- Perform feature engineering on an AWS Glue interactive session.
- Train and deploy an ML model using the SageMaker built-in XGBoost algorithm.
- Evaluate the ML model.
The following diagram illustrates the solution architecture.
Prerequisites
To complete this tutorial, you must have the following prerequisites:
- Have an AWS account. If you don’t have an account, you can create one.
- Complete the initial setup of Lake Formation by creating a data lake administrator and changing the default Data Catalog settings to enable fine-grained access control with Lake Formation permissions. For more information, see Setting up AWS Lake Formation. See the following example AWS Command Line Interface (AWS CLI) command:
- Create a SageMaker domain. You can use the Quick setup.
Set up resources with AWS CloudFormation
This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console and the AWS CLI rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.
The CloudFormation template generates the following resources:
- S3 buckets with a sample dataset
- An AWS Lambda function to load the dataset
- AWS Identity and Access Management (IAM) group, users, roles, and policies
- Lake Formation data lake settings and permissions
- SageMaker user profiles
To create your resources, complete the following steps:
- Sign in to the console.
- Choose Launch Stack:
- Choose Next.
- For DataEngineerPwd and DataScientistPwd, enter your own password for the data engineer and data scientist users.
- For GlueDatabaseName, enter
demo
. - For GlueTableName, enter
web_marketing
. - For S3BucketNameForInput, enter
blog-studio-pii-dataset-<your-aws-account-id>
. - For S3BucketNameForOutput, enter
blog-studio-output-<your-aws-account-id>
. - For SageMakerDomainId, enter your SageMaker domain ID that you prepared in the prerequisite steps.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
Stack creation can take up to 10 minutes. The stack creates IAM roles and SageMaker user profiles for two personas: data engineer and data scientist. It also creates a database demo and table web_marketing
with a sample dataset.
At the time of stack creation, the data engineer persona has complete access to the table, but the data scientist persona doesn’t have any access to the table yet.
Preprocess the dataset
Let’s start preprocessing data on an AWS Glue interactive session. The data engineer persona wants to verify the data to see if there is sensitive data or not, and grant minimal access permission to the data scientist persona. You can download notebook from this location.
- Sign in to the console using the data-engineer user.
- On the SageMaker console, choose Users.
- Select the data-engineer user and choose Open Studio.
- Create a new notebook and choose SparkAnalytics 1.0 for Image and Glue PySpark for Kernel.
- Start an interactive session with the following magic to install the newer version of Boto3 (this is required for using the
create_data_cells_filter
method): - Initialize the session:
- Create an AWS Glue DynamicFrame from the newly created table, and resolve choice types based on catalog schema, because we want to use the schema defined in the catalog instead of the automatically inferred schema based on data:
- Validate in the table whether there is any PII data using AWS Glue PII detection:
- Verify whether the columns classified as PII contain sensitive data or not (if not, update classified_map to drop the non-sensitive columns):
- Set up Lake Formation permissions using a data cell filter for automatically detected columns, and restrict the columns to the data scientist persona:
- Log in to Studio as data-scientist to see that the PII columns are not visible. You can download notebook from this location.
- Create a new notebook and choose SparkAnalytics 1.0 for Image and Glue PySpark for Kernel:
Perform feature engineering
We use the Apache Spark ML library to perform feature engineering as the data-scientist user and then write back the output to Amazon S3.
- In the following cell, we apply features from the Apache Spark ML library:
StringIndexer
maps a string column of labels to a column of label indexes.OneHotEncoder
maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value that indicates the presence of a specific categorical feature. This transform is used for ML algorithms that expect continuous features.VectorAssembler
is a transformer that combines a given list of columns into a single vector column, which is then used in training ML models for algorithms such as logistic regression and decision trees.
- The final transformed DataFrame can be created using the Pipeline library. A pipeline is specified as a sequence of stages. These stages are run in order and the input DataFrame is transformed as it passes through each stage.
- Next, we split the dataset into train, validate, and test DataFrame and save it in the S3 bucket to train the ML model (provide your AWS account ID in the following code):
Train and deploy an ML model
In the previous section, we completed feature engineering, which included converting string columns such as region
, jobrole
, and usedpromo
into a format that is optimal for ML models. We also included columns such as pageviewspervisit
and totalwebvisits
, which will help us predict a customer’s propensity to buy a product.
We now train an ML model by reading the train and validation dataset using the SageMaker built-in XGBoost algorithm. Then we deploy the model and run an accuracy check. You can download notebook from this location.
In the following cell, we’re reading data from the second S3 bucket, which includes the output from our feature engineering operations. Then we use the built-in algorithm XGBoost to train the model.
- Open a new notebook. Choose Data Science for Image and Python 3 for Kernel (provide your AWS account ID in the following code):
- When training is complete, we can deploy the model using SageMaker hosting services:
Evaluate the ML model
We use the test dataset to evaluate the model and delete the inference endpoint when we’re done to avoid any ongoing charges.
- Evaluate the model with the following code:
The accuracy result for the sample run was 84.6 %. This could be slightly different for your run due to the random split of the dataset.
- We can delete the inference endpoint with the following code:
Clean up
Now to the final step, cleaning up the resources.
- Empty the two buckets created through the CloudFormation stack.
- Delete the apps associated with user
profiles data-scientist
anddata-engineer
within Studio. - Delete the CloudFormation stack.
Conclusion
In this post, we demonstrated a solution that enables personas such as data engineers and data scientists to perform feature engineering at scale. With AWS Glue interactive sessions, you can easily achieve feature engineering at scale with automatic PII detection and fine-grained access control without needing to manage any underlying infrastructure. By using Studio as the single entry point, you can get a simplified and integrated experience to build an end-to-end ML workflow: from preparing and securing data to building, training, tuning, and deploying ML models. To learn more, visit Getting started with AWS Glue interactive sessions and Amazon SageMaker Studio.
We are very excited about this new capability and keen to see what you’re going to build with it!
Appendix: Set up resources via the console and the AWS CLI
Complete the instructions in this section to set up resources using the console and AWS CLI instead of the CloudFormation template.
Prerequisites
To complete this tutorial, you must have access to the AWS CLI (see Getting started with the AWS CLI) or use command line access from AWS CloudShell.
Configure IAM group, users, roles, and policies
In this section, we create two IAM users: data-engineer and data-scientist, which belong to the IAM group data-platform-group. Then we add a single IAM policy to the IAM group.
- On the IAM console, create a policy on the JSON tab to create a new IAM managed policy named
DataPlatformGroupPolicy
. The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. Use the following JSON policy document to provide permissions: - Create an IAM group called
data-platform-group
. - Search and attach the AWS managed policy named DataPlatformGroupPolicy to the group.
- Create IAM users called data-engineer and data-scientist under the IAM group data-platform-group.
- Create a new managed policy named SageMakerExecutionPolicy (provide your Region and account ID in the following code):
- Create a new managed policy named
SageMakerAdminPolicy
: - Create an IAM role for SageMaker for the data engineer (data-engineer), which is used as the corresponding user profile’s execution role. On the Attach permissions policy page, AmazonSageMakerFullAccess (AWS managed policy) is attached by default. You remove this policy later to maintain minimum privilege.
- For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-engineer.
- For Tags, add the key userprofilename and the value data-engineer.
- Choose Create role.
- To add the remaining policies, on the Roles page, choose the role name you just created.
- Under Permissions, remove the policy AmazonSageMakerFullAccess.
- On the Attach permissions policy page, select the AWS managed policy AwsGlueSessionUserRestrictedServiceRole, and the customer managed policies SageMakerExecutionPolicy and SageMakerAdminPolicy that you created.
- Choose Attach policies.
- Modify your role’s trust relationship:
- Create an IAM role for SageMaker for the data scientist (data-scientist), which is used as the corresponding user profile’s execution role.
- For Role name, name the role SageMakerStudioExecutionRole_data-scientist.
- For Tags, add the key userprofilename and the value data-scientist.
- Choose Create role.
- To add the remaining policies, on the Roles page, choose the role name you just created.
- Under Permissions, remove the policy AmazonSageMakerFullAccess.
- On the Attach permissions policy page, select the AWS managed policy AwsGlueSessionUserRestrictedServiceRole, and the customer managed policy SageMakerExecutionPolicy that you created.
- Choose Attach policies.
- Modify your role’s trust relationship:
Configure SageMaker user profiles
To create your SageMaker user profiles with the studiouserid
tag, complete the following steps:
- Use the AWS CLI or CloudShell to create the Studio user profile for the data engineer (provide your account ID and Studio domain ID in the following code):
- Repeat the step to create a user profile for the data scientist, replacing the account ID and Studio domain ID:
Create S3 buckets and upload the sample dataset
In this section, you create two S3 buckets. The first bucket has a sample dataset related to web marketing. The second bucket is used by the data scientist to store output from feature engineering tasks, and this output dataset is used to train the ML model.
First, create the S3 bucket for the input data:
- Download the dataset.
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Choose Create bucket.
- For Region, choose the Region with the SageMaker domain that includes the user profiles you created.
- For Bucket name, enter
blog-studio-pii-dataset-<your-aws-account-id>
. - Choose Create bucket.
- Select the bucket you created and choose Upload.
- In the Select files section, choose Add files and upload the dataset you downloaded.
Now you create the bucket for the output data: - On the Buckets page, choose Create bucket.
- For Region, choose the Region with the SageMaker domain that includes the user profiles you created.
- For Bucket name, enter
blog-studio-output-<your-aws-account-id>
. - Choose Create bucket.
Create an AWS Glue database and table
In this section, you create an AWS Glue database and table for the dataset.
- On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
- Choose Add database.
- For Name, enter demo.
- Choose Create database.
- Under Data catalog, choose Tables.
- For Name, enter
web_marketing
. - For Database, select
demo
. - For Include path, enter the path of your S3 bucket for input data.
- For Classification, choose CSV.
- Under Schema, choose Upload Schema.
- Enter the following JSON array into the text box:
- Choose Upload.
- Choose Submit.
- Under Table details, choose Edit table.
- Under Table properties, choose Add.
- For Key, enter
skip.header.line.count
, and for Value, enter 1. - Choose Save.
Configure Lake Formation permissions
In this section, you set up Lake Formation permissions to allow IAM role SageMakerStudioExecutionRole_data-engineer
to create a database and register the S3 location within Lake Formation.
First, register the data lake location to manage tables under the location in Lake Formation permissions:
- Choose Data lake locations.
- Choose Register location.
- For Amazon S3 path, enter
s3://blog-studio-pii-dataset-<your-aws-account-id>/
(the bucket that contains the dataset). - Choose Register location.
Now you grant Lake Formation database and table permissions to the IAM rolesSageMakerStudioExecutionRole_data-engineer
andSageMakerStudioExecutionRole_data-scientist
.First, grant database permission forSageMakerStudioExecutionRole_data-engineer
: - Under Permissions, choose Data lake permissions.
- Under Data permission, choose Grant.
- For Principals, choose IAM users and roles, and select the role
SageMakerStudioExecutionRole_data-engineer
. - For Policy tags or catalog resources, choose Named data catalog resources.
- For Databases, choose demo.
- For Database permissions, select Super.
- Choose Grant.
Next, grant table permission forSageMakerStudioExecutionRole_data-engineer
: - Under Data permission, choose Grant.
- For Principals, choose IAM users and roles, and select the role
SageMakerStudioExecutionRole_data-engineer
. - For Policy tags or catalog resources, choose Named data catalog resources.
- For Databases, choose
demo
. - For Tables, choose
web_marketing
. - For Table permissions, select Super.
- For Grantable permissions, select Super.
- Choose Grant.
Finally, grant database permission forSageMakerStudioExecutionRole_data-scientist
: - Under Data permission, choose Grant.
- For Principals, choose IAM users and roles, and select the role
SageMakerStudioExecutionRole_data-scientist
. - For Policy tags or catalog resources, choose Named data catalog resources.
- For Databases, choose
demo
. - For Database permissions, select Describe.
- Choose Grant.
About the Authors
Praveen Kumar is an Analytics Solution Architect at AWS with expertise in designing, building, and implementing modern data and analytics platforms using cloud-native services. His areas of interests are serverless technology, modern cloud data warehouses, streaming, and ML applications.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys collaborating with different teams to deliver results like this post. In his spare time, he enjoys playing video games with his family.