AWS Database Blog
How to Migrate Your Data Warehouse to Amazon Redshift Using the AWS Schema Conversion Tool Data Extractors
This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) data extractors to migrate your existing data warehouse to Amazon Redshift. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. Migrating your data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. AWS provides the Schema Conversion Tool free of charge to help you migrate your legacy data warehouse to the cloud.
Customers usually use SCT to migrate their schema objects (table definitions, indexes, triggers, and other execution logic) from legacy data warehouse platforms. SCT has been enhanced also to migrate the actual data records. This feature, called the SCT data extractors, can migrate your data from any legacy platform (Oracle, Greenplum, Netezza, Teradata, Microsoft SQL Server, or Vertica), using a parallel, secure and fault-tolerant architecture.
Typically, there are two stages to migrating a data warehouse. In the first stage, a full copy of the legacy data is replicated to the target system. We call this stage the “full load.” The full load can take a considerable amount of time depending on how much data there is to move. This being the case, any intervening changes are usually captured in a subsequent change data capture stage.
The SCT extractors are designed to facilitate the full load process. The extractors are designed with two key features to support the full load process:
- Quickly move large numbers of records using multi-threaded data extraction and load processes
- In case of failure, an extraction process can restart from the point of failure, rather than from the beginning of the entire data extract
In the second stage, SCT provides support for point-in-time data extracts so that “change deltas” since the full load can be captured and migrated in a second step.
Let’s take a look at architecture of the SCT extractors and how they can help you accomplish the full data load process.
Architecture
You can install the SCT extractors on any server machines with connectivity to your source data warehouse. The SCT client application issues commands to the extractors to pull data from the source system, upload the data to Amazon S3, and copy the data into Amazon Redshift. You can install multiple extractors on the same machine, or separate them as needed across many machines. Which is best depends on your available server resources and network capacities. In general, a best practice is to locate the data extractors close to your data warehouse to reduce network traffic and latency.
The extractors provide a fully managed framework for your data warehouse migration. The data extractors write their extract files to local storage. The extract files are compressed, using lossless LZO compression, then copied to an S3 bucket. The files are compressed to reduce network transfer time and to reduce storage costs on S3. After the files are uploaded to S3, the extractor triggers Amazon Redshift to copy the data into the target tables using the standard Amazon Redshift COPY command. Each of these steps can be repeated if necessary with no unexpected consequences on the target database.
For security purposes, SCT supports encrypted communication using Secure Socket Layer (SSL) between the SCT client and the extractors. If you want to learn more about using SSL with the extractors, we have a companion blog post to come to explain that process, How to Use SSL with the AWS Schema Conversion Tool Data Extractors. However, before moving on to SSL, we recommend that you read this post first to understand the overall extractor architecture and setup.
Now that we know what the architecture looks like, let’s take a look at the steps involved in setting up the data extractors.
The scenario we work through in this post is migrating an Oracle data warehouse to Amazon Redshift. We assume that you’ve already converted your source schema to Amazon Redshift and applied it on the target database. We won’t try to cover all possible migration scenarios—our intent is to give you a deep dive into the technology so you can confidently manage your migration projects.
Our Oracle database is hosted in Amazon RDS, but the same migration considerations apply to on-premises sources. We’ll install two extractor agents running on Red Hat Linux on Amazon EC2 instances. We’ll only work through the configuration of one extractor; the process is the same for both. The servers are configured as EC2 t2.medium instances (two virtual CPU, 4 GB of memory, and 100 GB of local storage). The instance size values can be tuned for your particular data volume, latency, and performance requirements.
The extractors are controlled by an SCT client running on a Mac laptop. Because the client offloads data processing chores to the extractors, a small desktop or laptop is perfectly suited to run the client.
A great feature of the SCT client is that it doesn’t need to be online while the data extractors are running. The SCT client can reconnect to the extractors at any time by opening the specific migration project. One client can manage many extraction projects. The data extractors continue to operate even if the SCT client is unreachable for any reason.
Getting Started
You need to install the following applications if they’re not already on your server. The links provided are for Linux downloads.
unzip | https://oss.oracle.com/el4/unzip/unzip.html |
Oracle JDBC driver | http://www.oracle.com/technetwork/database/features/jdbc/default-2280470.html
|
Amazon Redshift JDBC driver | http://docs.thinkwithwp.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver |
AWS CLI |
|
yum | http://yum.baseurl.org/ |
You need to set up permissions in AWS Identity and Access Management Service (IAM) so that the extractors can read, write and list the S3 bucket. Create two inline policies as follows. The first policy permits the extractor to read, write, and delete objects in the bucket. The second policy permits the extractor to list the contents of the bucket.
In the SCT client, you need to create a service profile. The service profile is how SCT inherits permissions you set up in IAM.
In IAM, create an AWS user with access permissions to the S3 bucket, if you don’t have one already. Copy and paste the access key and secret key for this user into the Global Settings section. Then, paste the name of the S3 bucket into the box shown and choose Save.
For the last setup item, we recommend that you have a SQL client available to query your Amazon Redshift target. SQLWorkbench, available on the AWS website, works great for querying your migrated data. You can download it from the following link.
http://docs.thinkwithwp.com/redshift/latest/mgmt/connecting-using-workbench.html.
Be sure to check the Autocommit box when you configure your Amazon Redshift connection.
In the SCT client, you connect to your source and target databases as shown following. We assume you’ve already migrated your source schema to the target database and are preparing to migrate your data.
Setup
Step 1: Install the Extractors
Download SCT from the AWS website to your local machine. Be sure to choose the correct .zip archive for your machine. Because we’re installing on Red Hat Linux, we choose the Fedora Linux download.
Now copy the .zip file to the machine you’re installing the extractors on and unzip the archive. If you can install the package using root (or sudo), as shown following, it simplifies setting file and application permissions.
After unzipping the archive, you see the following files.
The installer files in the agents subfolder contain the data extractors. We are installing on Red Hat Linux, so we use the first (.rpm) file.
We use the yum package manager because it resolves package dependencies. You can use another package manager if one is available in your environment—for example, rpm.
The installation process creates the directory /usr/share/aws/sct-extractor
to contain the extractor configuration and executables.
Step 2: Configure the Extractors
Before configuring the extractors, you need to identify the following resources. Make a list of the item names and locations because they’re needed when you set up the extractors.
Shared file system
SCT can restart a task if the extraction server fails for any reason. This restart is optional, but we recommend it as a best practice. We do so because it enables the task to restart from the point of failure, rather than reprocessing from the beginning.
To enable this feature, you need to create a shared file system to be mounted on all servers that host extraction agents. Because our servers are running in Amazon EC2, we created an Amazon EFS directory for this purpose and NFS-mounted it to our servers. In the following, our EFS mount is named ffs-384bf071.efs.us-east-1.amazonaws.com
.
The installation process creates the directory /usr/share/aws/sct-extractor to contain the extractor configuration and executables.
Now, run the sct-extract-setup.sh program to configure the extractor. The extractor runs as a listener on the port designated in the configuration. This same port number is specified when in the SCT client in order to register the listener. We use the same source database (Oracle), port number (8192), and trust and key store names and passwords that we specified earlier in the SCT client.
After completing the configuration, inspect the settings.properties file. It should look like this.
Finally, we are ready to start the extractor process. Start the process using the systemctl command and verify that it is running.
At this point, the configuration of the extractors is complete. The next step is to establish communication with the SCT client by registering the extractor.
Configure the SCT Client
Return to the SCT client and choose View, Data Migration View. Choose the Agents tab and choose Register to register a new agent. Fill in a name for the extractor, the server name, and the port it is listening on. Don’t check the Use SSL box, because we didn’t configure SSL in the extractor setup.
Verify the connection to the agent by choosing Test Connection. Then, choose Register to establish the connection to the extractor.
If the test is successful, the client can send commands to the extractor. You can corroborate this on the extractor server by inspecting the extractor logs in the Private directory (/home/ec2-user/private
for this example).
The extractor is now ready for use, and we can start the actual data warehouse migration process.
Data Migration
At this point, the SCT client is connected to our source and target databases, and we have registered two extractors running on separate EC2 Linux servers. The Status field in the middle pane shows whether the extractor is available or not. A green A indicates the extractor is running and accepting commands. You can select the individual agents to see specific information as shown following.
We are migrating three tables from Oracle to Amazon Redshift. The row counts for each table are shown following. The first two tables are regular, nonpartitioned tables. The third table is range-partitioned into 10 approximately equal-size partitions.
Table | Count | Partitioned |
A | 67,108,864 | No |
B | 3,199,999 | No |
C | 67,108,864 | Range-partitioned (10) |
To start the migration, you create a local task. Open the context (right-click) menu for the name of the source table you want to migrate, or right-click the Tables heading to migrate all tables. Because we’re going to migrate all three tables in the DBMASTER schema, we right-click Tables and then select “Create Local Task” to create a migration task.
We give the task a name then select a migration mode. SCT gives you three migration options:
- Extract to local storageThis option writes the selected data to local storage on the extractor servers.
The data files are compressed using LZO compression. - Extract and upload to S3
This option uploads the LZO compressed files to S3. Storage of the data in S3 can provide a simple backup solution for the source database. It can also allow other teams to use the data outside of the database context. - Extract, upload to S3, and copy into Amazon Redshift
This option triggers the ingestion of the data into Amazon Redshift using the Amazon Redshift COPY command.
Because we want to migrate the data from source to target we choose the third option, Extract, Upload and Copy, to perform the full load to Amazon Redshift.
Choose Test Task to verify that the task has the necessary permissions to AWS resources, including the S3 bucket for extract file storage. When you choose Test Task, you might receive an error message that AWS DMS permissions are missing. You can ignore this error message. Otherwise, if this test is successful you’re ready to start the migration. Choose Create to create the task.
To start the migration, choose Start, in the middle of the page.
The SCT client shows the progress of each step of the migration. On the extraction servers, the extractor logs show the progress of each agent.
One great thing about the extractors is that they use parallel threads throughout the extract, upload, and copy process. You can see this preceding, where all three phases of the process are running simultaneously.
Under the covers, SCT parallelizes the extracts by table and partition. Each table or partition consumes a single thread. The degree of parallelism is governed only by the number of threads available on the extraction servers. You can scale out to more extraction servers, or scale up to bigger extraction servers depending on your server resources to increase the throughput.
When an extract finishes, SCT immediately triggers the compression and upload of that data to S3, and then the copy of that data into Amazon Redshift. SCT triggers each step on an extract-by-extract basis. In other words, as soon as a nonpartitioned table is extracted, SCT uploads the file to S3, regardless of the status of any other data extracts.
To see details on each subtask (extract, upload, and copy), choose Task, then choose the Subtasks tab. This view shows which tables are being processed by which extractors, and what phases they have completed or are currently processing.
To see detailed status for each task and subtask, see the logs in the Private directory on the extractor servers. Following are relevant snippets from the log, showing the extract, upload, and copy processes from the log file for Table A.
And that’s it for setting up and running a migration. When the migration is done, the client shows all subtasks and tasks as 100 percent complete.
Summary
Although this blog post is not intended to outline all possible steps or considerations that might be required for your particular situation, it should give you a good idea how you can use the SCT data extractors to move your on-premises data warehouse to Amazon Redshift. Using the data extractors breaks down into three steps:
-
-
- Install the extractor software
- Configure and register the agents
- Start the migration and monitor
-
We hope that this information gives you a head start before embarking on a data warehouse migration. Or, if you’re already using the extractors, we hope it gives you a deeper understanding of how the system works under the covers.
Happy migrating!
About the Author
Michael Soo is a database engineer at Amazon Web Services.