AWS Big Data Blog
Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio
Nowadays many customers are following a multi-cloud strategy. They might choose to use various cloud-managed services, such as Amazon Relational Database Service (Amazon RDS) for SQL Server and Azure SQL Managed Instances, to perform data analytics tasks, but still use traditional extract, transform, and load (ETL) tools to integrate and process the data. However, traditional ETL tools may require you to develop custom scripts, which makes ETL automation difficult.
In this post, I show you how to automate ETL jobs between Amazon RDS for SQL Server and Azure SQL Managed Instances using AWS Glue Studio, which is part of AWS Glue, a fully managed serverless integration service. AWS Glue Studio has a graphical interface that makes it easy to create, run, and monitor ETL jobs, and you can create a schedule to run your jobs at specific times.
Solution overview
To move data from one database to another, there are different services available either on-premise or in the cloud, varied by bandwidth limits, ongoing changes (CDC), schema and table modifications, and other features. Beyond that, we need to apply advanced data transformations, monitor, and automate the ETL jobs. This is where AWS Glue Studio can help us facilitate these activities.
As shown in the following diagram, we use AWS Glue Studio as the middleware to pull data from the source database (in this case an Azure SQL Managed Instance), then create and automate the ETL job using one of the pre-built transformations in AWS Glue Studio. Finally, we load the data to the target database (in this case an RDS for SQL Server instance).
The solution workflow consists of the following steps:
- Create connections for the source and target databases.
- Create and run AWS Glue crawlers.
- Create and run an ETL job that transforms the data and loads it from source to target.
- Schedule the ETL job to run automatically.
- Monitor the ETL job.
Prerequisites
Complete the following prerequisite steps:
- Install SQL Server Management Studio (SSMS) or an equivalent client tool.
- Set up a VPN connection between Amazon Virtual Private Cloud (Amazon VPC) and the Azure private subnet.
- Create a security group for AWS Glue ENI in your VPC.
- Create an AWS Identity and Access Management (IAM) role for AWS Glue. For instructions, refer to Setting up IAM permissions for AWS Glue.
- Open the appropriate firewall ports in the Azure private subnet.
- Create a source database table (Azure SQL Managed Instance). You can deploy the Azure database instance using the following QuickStart. For testing purposes, I import the public AdventureWorks sample database and use the dbo.Employee table. See the following code:
#Query table SELECT * FROM [AdventureWorksLT2019].[dbo].[Employee]
- Create the target database table (Amazon RDS for SQL Server). To deploy the RDS instance, refer to Create and Connect to a Microsoft SQL Server Database with Amazon RDS. You can create an empty database and table with the following statements. This is the table where the data coming from Azure will be stored.
#Create database
CREATE DATABASE AdventureWorksonRDS;
#Create table
CREATE TABLE Employee
(EmpID INT NOT NULL,
EmpName VARCHAR(50) NOT NULL,
Designation VARCHAR(50) NULL,
Department VARCHAR(50) NULL,
JoiningDate DATETIME NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED (EmpID)
#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]
Create connections
The first step is to populate our AWS Glue Data Catalog with the schema information coming from our source and target data sources.
To do that, we first create connections. A connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, and more. Creating connections in the Data Catalog saves the effort of having to specify the connection details every time you create a crawler or job.
Create a connection for Azure SQL Managed Instance
To create the connection to our source database, complete the following steps:
- On the AWS Glue console, choose AWS Glue Studio.
- In the navigation pane of the AWS Glue Studio console, choose Connectors.
- Choose Create connection.
- For Name, enter
AzureSQLManaged
. - For Connection type, choose JDBC.
- For JDBC URL, use the SQL Server syntax
jdbc:protocol://host:port;database=db_name
.
You can find the host and database name on the Azure SQL Managed Instance service console, on the Overview page.For this specific example, we use the following information for our Azure SQL Instance:
-
- Protocol –
sqlserver
- Host –
adi-qa-sql-managed-instance-test.public.xxxxxxxxxxxx.database.windows.net
- Port –
3342
- Database name –
AdventureWorksLT2019
- Protocol –
Enter your user name and password.
Choose Create connection.
Create a connection for Amazon RDS for SQL Server
To create a connection for our target data source, complete the following steps:
- On the AWS Glue Studio console, choose Connectors in the navigation pane.
- Choose Create connection.
- For Name, enter
AWSRDSforSQL
. - For Connection type, choose Amazon RDS.
- For Database engine, choose Microsoft SQL Server.
- For Database instances, choose your RDS DB instance.
- For Database name, enter
AdventureWorksonRDS
. - Enter your user name and password.
- Choose Create connection.
You can now see the two connections created in the Connections section.
Create and run AWS Glue crawlers
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the most common method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, it updates the Data Catalog with the tables it found. The ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
Create a crawler for Azure SQL Managed Instance
To create a crawler for our source database, complete the following steps:
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Choose Create crawler.
- If the data hasn’t been mapped into an AWS Glue table, select Not yet and choose Add a data source.
- For Data source¸ choose JDBC.
- For Connection, choose
AzureSQLManaged
. - For Include path, specify the path of the database including the schema:
AdventureWorksLT2019/dbo/%
. - Choose Add a JDBC data source.
- Choose Next.
- Choose the IAM role created as part of the prerequisites and choose Next.
- Choose Add database to create the target database in the AWS Glue Data Catalog.
- For Name, enter
azuresqlmanaged_db
. - Choose Create database.
- For Target database, choose
azuresqlmanaged_db
. - Choose Next.
- Review if everything looks correct and choose Create crawler.
Create a crawler for Amazon RDS for SQL Server
Repeat the crawler creation steps to create the crawler for the target RDS for SQL Server database, using the following information:
- Crawler name –
AmazonRDSSQL_Crawler
- Data source – JDBC
- Connection –
AWSRDSforSQL
- Include path –
AdventureWorksonRDS/dbo/%
- IAM role –
AWSGlueServiceRoleDefault
- Database name –
amazonrdssql_db
Run the crawlers
Now it’s time to run the crawlers.
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Select the crawlers you created and choose Run.
- When the crawler is complete, choose Databases in the navigation pane. Here you can find the databases discovered by the crawler.
- Choose Tables in the navigation pane and explore the tables discovered by the crawler that correctly identified the data type as SQL Server.
- Choose the table adventureworkslt2019_dbo_employee and review the schema created for the data source.
Create and run an ETL job
Now that we have crawled our source and target databases, and we have the data in the AWS Glue Data Catalog, we can create an ETL job to load and transform this data.
- On the AWS Glue Studio console, choose Jobs in the navigation pane.
- Select Visual with a blank canvas to use a visual interface to create our ETL jobs.
- Choose Create.
- On the Source menu, choose AWS Glue Data Catalog.
- On the Data source properties tab, specify the database and table (for this post,
azuresqlmanaged_db
andadventureworkslt2019_dbo_employee
). - On the Transform menu, choose Apply mapping to map the source fields to the target database.
- On the Transform tab, you can see the data fields to be loaded, and you even can drop some of them if needed.
- On the Target menu, choose AWS Glue Data Catalog.
- On the Data target properties tab, choose the database and table where you want to load the transformed data (for this post,
amazonrdssql_db
andadventureworksrds_dbo_employee
). - On the Job details tab, for Name, enter
ETL_Azure_to_AWS
. - For IAM Role, choose the appropriate role.
- Choose Save.
- Choose Run to run the job.
If the ETL job ran successfully, it should map the data from the source database (Azure SQL) to the target database (Amazon RDS for SQL). To confirm it, you can connect to the target database using SQL Server Management Studio (SSMS), and query the empty database/table AdventureWorksonRDS/dbo.Employee
. It should have the data coming from the Azure SQL Managed Instance.
#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]
Schedule your ETL job
In AWS Glue Studio, you can create a schedule to have your jobs run at specific times. This will reimport the full dataset and reference the use of bookmarks to do incremental loads. You can schedule your ETL jobs on an hourly, daily, weekly, monthly, or custom basis, depending on your needs. To schedule a job, complete the following steps:
- On the AWS Glue Studio, navigate to the job you created.
- On the Schedules tab, choose Create schedule.
- For Name, enter a name (for example,
dbo_employee_daily_load
). - Choose your preferred frequency, start hour, and minute of the hour. For this post, we schedule it daily at 3:00 UTC.
- For Description, enter an optional description.
- Choose Create schedule.
Confirm on the Schedules tab that the schedule was created and activated successfully.
You have now automated your ETL job to run at your desired frequency.
Monitor your ETL job
The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with a status of Running, Canceled, Success, or Failed.
The Runs tab shows the jobs for the specified date range and filters. You can filter the jobs on additional criteria, such as status, worker type, job type, and job name.
Conclusion
In this post, I went through the steps to automate ETL jobs using AWS Glue Studio, which is a user-friendly graphical interface to perform data integration tasks such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. You can easily find and access this data using the AWS Glue Data Catalog. Data engineers and ETL developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.
About the author
Daniel Maldonado is an AWS Solutions Architect, specializing in Microsoft workloads and big data technologies, and focused on helping customers migrate their applications and data to AWS. Daniel has over 13 years of experience working with information technologies and enjoys helping clients reap the benefits of running their workloads in the cloud.