AWS Database Blog
Better Together: Amazon SageMaker Canvas and RDS for SQL Server, a predictive ML model sample use case
Artificial intelligence (AI) and machine learning (ML) solutions are rapidly gaining traction worldwide, capturing the imagination of users and customers across various industries. As businesses strive to integrate AI/ML capabilities into their customer-facing services and solutions, they often face the challenge of leveraging massive amounts of relational data hosted on on-premises SQL Server databases.
This post showcases how Amazon Relational Database Service (Amazon RDS) for SQL Server and Amazon SageMaker Canvas can work together to address this challenge. By leveraging the native integration points between these managed services, you can develop integrated solutions that use existing relational database workloads to source predictive AI/ML models with minimal effort and no coding required.
Solution overview
Amazon RDS for SQL Server simplifies the process of setting up, operating, and scaling SQL Server deployments in the AWS Cloud. You can deploy multiple editions of SQL Server, including Express, Web, Standard, and Enterprise, in minutes with cost-efficient and resizable compute capacity. Amazon RDS frees you from time-consuming database administration tasks, allowing you to focus on application development.
Amazon SageMaker brings together a broad set of tools to enable high-performance, low-cost ML for any use case. With Amazon SageMaker Canvas, you can create highly accurate machine learning models without any machine learning experience or writing a single line of code. These AWS native services are specifically designed to help you develop integrated solutions that can leverage existing relational database workloads as the source for predictive AI/ML models.
To support native integration between SageMaker Canvas and Amazon RDS, we need to deploy the SageMaker domain in a VPC-only mode. When the SageMaker Canvas application runs in this mode, it can interact with other AWS services using VPC endpoints created in a customer-managed Virtual Private Cloud (VPC) without public internet access. SageMaker Canvas applications can access these VPC endpoints through a network interface created in Amazon SageMaker Studio, providing connectivity to the customer-managed VPC.
This architectural approach ensures that the RDS for SQL Server instance and the containers running the SageMaker Canvas application are hosted on the same VPC, with direct connectivity, as illustrated in the following diagram.
Prerequisites
We assume you have familiarity with navigating the AWS Management Console. For this post, you also need the following resources and services enabled in your AWS account:
- An RDS for SQL Server instance.
- A database backup to restore to the RDS for SQL Server instance. For this post, we use the following sample database backup.
- A customer-managed VPC to host the SageMaker domain and the RDS for SQL Server instance. Make sure that this VPC has at least two private subnets on different Availability Zones and no internet access. Refer to Choose an Amazon VPC for additional information about choosing a VPC when you onboard to a SageMaker domain.
- A designated security group within your VPC.
Even though there is a free tier available for SageMaker Canvas that should cover the deployment and implementation of this use case, you should be mindful of ongoing service charges. For additional details, refer to Amazon SageMaker Canvas pricing.
You may refer to Importing and exporting SQL Server databases using native backup and restore documentation to perform a database restore into an Amazon RDS for SQL Server instance.
Configure SageMaker Canvas
Follow the instructions in Configure Amazon SageMaker Canvas in a VPC without internet access to configure your VPC, SageMaker domain, required service endpoints, and AWS Identity and Access Management (IAM) permissions to support running SageMaker Canvas in a VPC without internet access. In this section, we provide detailed instructions on how to onboard a VPC-only SageMaker domain, if needed. Otherwise, skip to the steps for creating the predictive ML model.
Onboard the VPC-only SageMaker domain
Before starting with the onboarding process, verify that the chosen VPC to onboard the SageMaker Domain meets the following requirements:
- The VPC spans across two or more Availability Zones
- The VPC contains no public subnets
- The VPC contains no NAT Gateways
- The VPC contains at least two private subnets
- The VPC has one S3 Gateway fully configured
When the preceding requirements are met, proceed with the following steps:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose Create domain.
- Choose Set up for organizations, then choose Set up.
- For Domain details, provide an appropriate domain name, then choose Next.
- For Users and ML activities, select appropriate authentication method. For this post, we select Login through IAM.
You can’t change the authentication method after you create the domain.
- For Who will use SageMaker?, select Add user.
- In the pop-up, enter a user profile name and choose Add.
- For What ML activities will users perform?, select Create a new role and keep all default selections for the ML activities.
- For S3 Bucket Access, enter a unique name. For this post, we enter
sagemaker-canvas-us-west-2-repository-xxxxxxx
. - Choose Next.
- Validate that the option SageMaker Studio – New is already selected for the SageMaker Studio application.
- Keep all the application configuration settings as default, then choose Next.
- For How do you want to connect to other AWS services?, select Virtual Private Cloud (VPC) Only.
- For Which VPC do you want to use for Studio?, enter your VPC, subnets, and security groups.
- Choose Next.
Make sure the security group you specified is also attached to the RDS for SQL Server instance used as our source in the demo. This security group should also be attached to the service endpoints in the VPC.
- For Storage, keep all the settings as default and choose Next.
- Review all the selections during the onboarding process and then choose Submit to onboard the new SageMaker domain.
Create the SageMaker predictive ML model
After the VPC and SageMaker domain are configured, complete the following steps to create a SageMaker predictive ML model:
- On the SageMaker console, choose Domains in the navigation pane.
- Choose an appropriate SageMaker domain.
- On the Launch menu, choose Canvas.
This will launch the SageMaker Canvas application in a new browser tab. Launching the application for the first time can take a few minutes for new user profiles to complete initializing several required resources.
To address any connectivity issue, make sure that the security group used for all the service endpoints created in the VPC matches the security group used to onboard the SageMaker domain. If all resources were properly initialized and SageMaker Canvas successfully launched, a landing page similar to the one shown in the following screenshot will appear on a new tab.
- Choose My Models in the SageMaker Canvas navigation pane.
- Choose New model.
- For Model name, enter a name (for example, Customer Churn).
- For Problem type, select Predictive analysis.
- Choose Create.
- Choose Create dataset.
- For Dataset name, enter a name (for example,
rds-sql-sample-churn-training
). - Choose Create.
- For the data source, choose RDS.
- Choose Add connection.
- For Connection name, enter a name for the RDS connection.
- For Instance, choose your RDS for SQL Server instance.
- Enter your user credentials.
- Choose Add connection.
- Expand the
sagemaker_ml_sample
database node in the navigation pane, then expand the dbo schema node to visualize available tables in the database schema. - Drag the
customer_churn_train
table and then drop it in the main dataset section.
- Review the provided data sampling, then choose Create dataset.
SageMaker will process the data and create the new dataset. Upon completion, the new dataset will show in the list of available datasets.
- Select the new dataset from the list and choose Select dataset.
As soon as SageMaker Canvas completes importing this dataset into the domain, some data analysis has already been done and is visible on the page: column data type, column feature type (SageMaker Canvas recognizes categorical feature types, which represent distinct categories or groups, as well as binary feature types that have two possible values, such as true/false), column missing or mismatched values (SageMaker Canvas automatically infers the missing or mismatched values), number of unique values, and mode. Next, we need to select the column we are targeting to predict with this model.
- On the Build tab, choose Churn for Target column.
This will populate the value distribution section and automatically select a 2 category prediction model.
- Choose Data visualizer.
The data visualizer module provides a powerful set of data visualization and analysis tools, which can come in handy while we are preparing training (or testing) data for an ML model.
- As a sample visualization scenario, let’s take the following actions:
- Select the bar chart visualization option.
- Drag and drop the
Day_Calls
column on the Y axis of the chart - Drag and drop the State column on the X axis of the chart.
As a result, we should see a bar chart similar to the one in the following screenshot. You can explore many other data visualization and analysis scenarios in this module.
- Choose Preview model and wait for the process to complete.
- Choose the Analytics tab to preview the model analytics.
A correlation matrix is presented on this tab, showcasing how the dataset cells are correlated to each other—particularly (for our use case), Churn compared to other columns in the dataset.
- Choose Quick build and wait for the process to complete.
When the model build is complete (2–15 minutes), an initial report is presented that includes the model accuracy and a selection of data visualization charts grouped on the Overview tab.
- Choose the
Day_Mins
(day minutes) column and choose True for the churn value, or prediction.
You should see a chart similar to the one in the following screenshot. This column is a strong customer churn predictor; the higher the number of day minutes, the higher the probability of experiencing customer churn. You can choose the Scoring tab to dive deeper into how accurate our ML model is when compared against our own training dataset.
- On the Scoring tab, choose the True segment of the chart.
The model accuracy shows how 90.141% of the time the model is correct when the customer churn prediction is true.
- Choose Predict to predict target values.
- Choose Batch prediction to generate predictions based on a test dataset.
- Choose Manual (the Automatic option is designed to launch predictions in the future based on a specific schedule).
- Select the appropriate customer churn testing dataset sourced from the sample RDS for SQL Server database and choose Generate predictions.
To generate the testing dataset, follow the same instructions described earlier to import the training dataset.
- When the ML model predictions generation process is complete, choose the View link displayed in the pop-up message on the bottom of the screen
We can observe the True customer churn predictions on the first (95.5%), seventh (79.6%), eighth (75.8%), and twelfth (86.8%) rows. These are great customer candidates to follow-up on and try to prevent churn. Downloading these predictions and analyzing the results in more detail is also recommended.
Clean up
To avoid future charges and remove the components you created while testing this use case, complete the following steps:
- On the Amazon RDS console, choose Databases in the navigation pane.
- Select the database you set up, and on the Actions menu, choose Delete.
- On the SageMaker console, choose Domains and choose your domain.
- Choose Edit.
- Choose Delete domain.
Conclusion
In this post, we showed how to integrate SageMaker Canvas and Amazon RDS for SQL Server to deliver a predictive ML model with minimal effort and no coding. We were able to source our training and testing datasets from Amazon RDS for SQL Server with minimal configuration, onboard a SageMaker domain, visualize and analyze the data, create a predictive model, build it using our training dataset, and generate some sample predictions using our test dataset, without having to write a single line of code.
This use case highlights how two native AWS services can work cohesively together to streamline the machine learning process.
If you have any feedback or questions, leave it in the comments section.
About the Authors
Camilo Leon is a Principal Solutions Architect at AWS specialized on databases and based off San Francisco, California. He works with AWS customers to provide architectural guidance and technical support for the design, deployment and management of their AWS relational database workloads and business applications. In his spare time, he enjoys mountain biking, photography, and movies.
Minesh Chande is Senior Database Specialist Solutions Architect at Amazon Web Services. He helps customers across different industry verticals design, migrate, modernize and optimize their SQL Server workloads to a managed database platform like Amazon RDS, Amazon RDS Custom and Babelfish for Aurora PostgreSQL.