AWS Big Data Blog
Create, train, and deploy Amazon Redshift ML model integrating features from Amazon SageMaker Feature Store
Amazon Redshift is a fast, petabyte-scale, cloud data warehouse that tens of thousands of customers rely on to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies. Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using SQL commands familiar to many roles such as executives, business analysts, and data analysts. We covered in a previous post how you can use data in Amazon Redshift to train models in Amazon SageMaker, a fully managed ML service, and then make predictions within your Redshift data warehouse.
Redshift ML currently supports ML algorithms such as XGBoost, multilayer perceptron (MLP), KMEANS, and Linear Learner. Additionally, you can import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for ML models. However, one challenge in training a production-ready ML model using SageMaker Feature Store is access to a diverse set of features that aren’t always owned and maintained by the team that is building the model. For example, an ML model to identify fraudulent financial transactions needs access to both identifying (device type, browser) and transaction (amount, credit or debit, and so on) related features. As a data scientist building an ML model, you may have access to the identifying information but not the transaction information, and having access to a feature store solves this.
In this post, we discuss the combined feature store pattern, which allows teams to maintain their own local feature stores using a local Redshift table while still being able to access shared features from the centralized feature store. In a local feature store, you can store sensitive data that can’t be shared across the organization for regulatory and compliance reasons.
We also show you how to use familiar SQL statements to create and train ML models by combining shared features from the centralized store with local features and use these models to make in-database predictions on new data for use cases such as fraud risk scoring.
Overview of solution
For this post, we create an ML model to predict if a transaction is fraudulent or not, given the transaction record. To build this, we need to engineer features that describe an individual credit card’s spending pattern, such as the number of transactions or the average transaction amount, and also information about the merchant, the cardholder, the device used to make the payment, and any other data that may be relevant to detecting fraud.
To get started, we need an Amazon Redshift Serverless data warehouse with the Redshift ML feature enabled and an Amazon SageMaker Studio environment with access to SageMaker Feature Store. For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.
We also need an offline feature store to store features in feature groups. The offline store uses an Amazon Simple Storage Service (Amazon S3) bucket for storage and can also fetch data using Amazon Athena queries. For an introduction to SageMaker Feature Store and instructions on setting it up, see Getting started with Amazon SageMaker Feature Store.
The following diagram illustrates solution architecture.
The workflow contains the following steps:
- Create the offline feature group in SageMaker Feature Store and ingest data into the feature group.
- Create a Redshift table and load local feature data into the table.
- Create an external schema for Amazon Redshift Spectrum to access the offline store data stored in Amazon S3 using the AWS Glue Data Catalog.
- Train and validate a fraud risk scoring ML model using local feature data and external offline feature store data.
- Use the offline feature store and local store for inference.
Dataset
To demonstrate this use case, we use a synthetic dataset with two tables: identity
and transactions
. They can both be joined by the TransactionID
column. The transaction table contains information about a particular transaction, such as amount, credit or debit card, and so on, and the identity table contains information about the user, such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.
The following is an example of the transactions dataset.
The following is an example of the identity dataset.
Let’s assume that across the organization, data science teams centrally manage the identity data and process it to extract features in a centralized offline feature store. The data warehouse team ingests and analyzes transaction data in a Redshift table, owned by them.
We work through this use case to understand how the data warehouse team can securely retrieve the latest features from the identity feature group and join it with transaction data in Amazon Redshift to create a feature set for training and inferencing a fraud detection model.
Create the offline feature group and ingest data
To start, we set up SageMaker Feature Store, create a feature group for the identity dataset, inspect and process the dataset, and ingest some sample data. We then prepare the transaction features from the transaction data and store it in Amazon S3 for further loading into the Redshift table.
Alternatively, you can author features using Amazon SageMaker Data Wrangler, create feature groups in SageMaker Feature Store, and ingest features in batches using an Amazon SageMaker Processing job with a notebook exported from SageMaker Data Wrangler. This mode allows for batch ingestion into the offline store.
Let’s explore some of the key steps in this section.
- Download the sample notebook.
- On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
- Locate your notebook instance and choose Open Jupyter.
- Choose Upload and upload the notebook you just downloaded.
- Open the notebook
sagemaker_featurestore_fraud_redshiftml_python_sdk.ipynb
. - Follow the instructions and run all the cells up to the Cleanup Resources section.
The following are key steps from the notebook:
- We create a Pandas DataFrame with the initial CSV data. We apply feature transformations for this dataset.
- We store the processed and transformed transaction dataset in an S3 bucket. This transaction data will be loaded later in the Redshift table for building the local feature store.
- Next, we need a record identifier name and an event time feature name. In our fraud detection example, the column of interest is
TransactionID.EventTime
can be appended to your data when no timestamp is available. In the following code, you can see how these variables are set, and thenEventTime
is appended to both features’ data. - We then create and ingest the data into the feature group using the
SageMaker SDK FeatureGroup.ingest
API. This is a small dataset and therefore can be loaded into a Pandas DataFrame. When we work with large amounts of data and millions of rows, there are other scalable mechanisms to ingest data into SageMaker Feature Store, such as batch ingestion with Apache Spark. - We can verify that data has been ingested into the feature group by running Athena queries in the notebook or running queries on the Athena console.
At this point, the identity feature group is created in an offline feature store with historical data persisted in Amazon S3. SageMaker Feature Store automatically creates an AWS Glue Data Catalog for the offline store, which enables us to run SQL queries against the offline data using Athena or Redshift Spectrum.
Create a Redshift table and load local feature data
To build a Redshift ML model, we build a training dataset joining the identity data and transaction data using SQL queries. The identity data is in a centralized feature store where the historical set of records are persisted in Amazon S3. The transaction data is a local feature for training data that needs to made available in the Redshift table.
Let’s explore how to create the schema and load the processed transaction data from Amazon S3 into a Redshift table.
- Create the
customer_transaction
table and load daily transaction data into the table, which you’ll use to train the ML model: - Load the sample data by using the following command. Replace your Region and S3 path as appropriate. You will find the S3 path in the S3 Bucket Setup For The OfflineStore section in the notebook or by checking the
dataset_uri_prefix
in the notebook.
Now that we have created a local feature store for the transaction data, we focus on integrating a centralized feature store with Amazon Redshift to access the identity data.
Create an external schema for Redshift Spectrum to access the offline store data
We have created a centralized feature store for identity features, and we can access this offline feature store using services such as Redshift Spectrum. When the identity data is available through the Redshift Spectrum table, we can create a training dataset with feature values from the identity feature group and customer_transaction
, joining on the TransactionId
column.
This section provides an overview of how to enable Redshift Spectrum to query data directly from files on Amazon S3 through an external database in an AWS Glue Data Catalog.
- First, check that the identity-feature-group table is present in the Data Catalog under the
sagemamker_featurestore
database.
- Using Redshift Query Editor V2, create an external schema using the following command:
All the tables, including identity-feature-group
external tables, are visible under the sagemaker_featurestore
external schema. In Redshift Query Editor v2, you can check the contents of the external schema.
- Run the following query to sample a few records—note that your table name may be different:
- Create a view to join the latest data from
identity-feature-group
andcustomer_transaction
on theTransactionId
column. Be sure to change the external table name to match your external table name:
Train and validate the fraud risk scoring ML model
Redshift ML gives you the flexibility to specify your own algorithms and model types and also to provide your own advanced parameters, which can include preprocessors, problem type, and hyperparameters. In this post, we create a customer model by specifying AUTO OFF and the model type of XGBOOST. By turning AUTO OFF and using XGBoost, we are providing the necessary inputs for SageMaker to train the model. A benefit of this can be faster training times. XGBoost is as open-source version of the gradient boosted trees algorithm. For more details on XGBoost, refer to Build XGBoost models with Amazon Redshift ML.
We train the model using 80% of the dataset by filtering on transactiondt < 12517618
. The other 20% will be used for inference. A centralized feature store is useful in providing the latest supplementing data for training requests. Note that you will need to provide an S3 bucket name in the create model statement. It will take approximately 10 minutes to create the model.