A hybrid approach for homogeneous migration to an Amazon DocumentDB elastic cluster

Today, customers use document databases for many different types of applications. For example, gaming clients use them for handling users’ attribute information, while a stock application employs a document-oriented database to store chronological quote data. As the number of documents grows over time, you need more compute and storage than what is traditionally offered through a single cluster, and managing a sizable database becomes challenging. To tackle this issue, a more scalable architecture is required to handle the increased workload in a document database.

In this post, we share a method for migrating a non-sharded document database to an Amazon DocumentDB (with MongoDB compatibility) elastic cluster.

Overview of Amazon DocumentDB elastic clusters

Amazon DocumentDB is a scalable, highly durable, and fully managed database service for operating mission-critical JSON workloads. Amazon DocumentDB elastic clusters support workloads with millions of reads/writes per second and petabytes of storage capacity, as shown in the following diagram. It provides up to 32 shards per cluster. With horizontal sharding, you can split a large collection horizontally into multiple shards, which expands the capabilities of the overall storage and computing. For more details, refer to Amazon DocumentDB elastic clusters: how it works.

Solution overview

Migrating a large database with minimum downtime has its challenges:

You should complete the full load in a shorter time
You must make the incremental replication fast and stable, because the source database’s workload is running

In this post, we use a hybrid approach to address these two challenges of migration. The hybrid approach uses the mongodump and mongorestore tools to migrate your data from your source DocumentDB instance-based cluster to your Amazon DocumentDB elastic cluster, and uses AWS Database Migration Service (AWS DMS) in change data capture (CDC) mode to replicate changes. If you need an online approach using AWS DMS, refer to the blog. mongodump is a tool that can dump data from the DocumentDB database and store it as a BSON file in local disk, which can then be restored to the target Amazon DocumentDB through the mongorestore tool.

For the CDC phase, we use AWS DMS to migrate the ongoing changes from the source to the target Amazon DocumentDB elastic cluster. AWS DMS version 3.5.1 began to support bulk applying into Amazon DocumentDB (including instance-based clusters and elastic clusters), which can accelerate the CDC progress. For more details, refer to Target metadata task settings.

The speed of index migration is also important. We recommend migrating Amazon DocumentDB indexes using amazon-documentdb-tools.

The following diagram illustrates the solution architecture.

The solution uses DocumentDB instance-based cluster as the source and an Amazon DocumentDB elastic cluster as the target database. The migration steps are follows:

Use mongodump to dump data from the source database. The database can still be open for reads and writes.
Create the sharding database and collections in an Amazon DocumentDB elastic cluster.
Migrate indexes.
Restore data using mongoretsore.
Set up and run an AWS DMS parallel CDC task.
Change the application endpoint to your Amazon DocumentDB elastic cluster.

Prerequisites

To follow along with the examples in this post, you must complete the following prerequisites:

Create an EC2 instance for running Database tools.
Install the Database tools.
Create an Amazon DocumentDB elastic cluster.
Enable change stream and enlarge the retention duration on the source cluster.

Create an EC2 instance for running database tools

You need create an EC2 instance for database tools, which is mainly used to run mongo-database-tools and DocumentDB-tools and store BSON files dumped from the source database (we call it EC2-for-tools for the remainder of this post). Because the BSON files take up a lot of disk space, a larger data disk needs to be mounted on the EC2 instance.

To deploy the EC2 instance, we provide an AWS CloudFormation template. For instructions to create the CloudFormation stack, see Creating a stack on the AWS CloudFormation console.

After the EC2 instance is deployed, make sure the file system’s size is enough to accommodate a backup.

Install the database tools

To install the database tools, refer to Installing the Database Tools on Linux. The version of database tools has to be mongodb-database-tools-amazon2-x86_64-100.6.1, because older versions of mongoresotre do not support restoring BSON files to an Amazon DocumentDB elastic cluster.

Create an Amazon DocumentDB elastic cluster

You can use an existing Amazon DocumentDB elastic cluster or create an elastic cluster.

Enable the change stream and enlarge the log retention on the source DocumentDB

The source documenDB’s change stream should be enabled and large enough to hold the data changes during mongodump and mongorestore because the AWS DMS CDC task relies on the changestream to replicate changes to Amazon DocumentDB elastic cluster. For instructions, refer to enabling change streams and modifying the change stream log retention duration.

Now you’re ready to start your migration.

Use mongodump to dump data from DocumentDB instance-based cluster

Go to the mongo-tools directory and run mongodump to export data using the following code:

mongodump -h ip-172-xxxx.ec2.internal:27017 -u <YourUser> -p <YourPassword> -d <YourDatabase> -c <YourCollection> -o /backup > dump.out

The options of command and their functions are as follows:

-d – Specifies the database name
-c – Specifies the collection name
-o – Specifies the dump output directory

You can view detailed options through the command mongodump --help.

Keep monitoring the dump.out file for the mongodump progress. When a message like done dumping <database name>.<collection name> appears, it means the backup is complete.

The source database can still be open for both read and write operations during this step. AWS DMS will replicate any changes made after the export command.

It’s necessary to record the UTC time when the dump was started (for example, 2023-04-12T04:27:01). You set the CDC start time for the AWS DMS CDC task based on this timestamp.

Create the sharding database and collections in the Amazon DocumentDB elastic cluster

Because we expect to rewrite a collection in the replica set to a sharding collection, we must design the sharding key in the Amazon DocumentDB elastic cluster and create the sharding collection in advance. Otherwise, mongorestore will import the data into only one of the shards and won’t achieve horizontal scalability as expected. Complete the following steps:

Connect to your Amazon DocumentDB elastic cluster.
Create a database with the following syntax:

use <databaseName>

Create a collection with the following syntax:

sh.shardCollection( "<databaseName>.<collectionName>", { "_id": "hashed" } )

We choose the default Object_id _id as the shard key and shards according to the hash method. As of this writing, elastic clusters only support a shard key with hash mode.

Migrate indexes

Before you import any data, you should use the Amazon DocumentDB index tool to migrate indexes to the target database. Before you migrate indexes, you must check whether you are using the following indexes, which are not supported in Amazon DocumentDB elastic clusters:

Sparse indexes
TTL indexes
Geospatial indexes
Background index create

If they’re in the source database, you have to modify the applications. For more details, refer to Limitations.

For migrating indexes, you can follow the instructions from the README.md file of the Amazon DocumentDB Index Tool GitHub repo. For our example, we migrate the index as follows:

Clone the repo and install the requirements:

git clone https://github.com/awslabs/amazon-documentdb-tools.git
cd amazon-documentdb-tools/index-tool
python3 -m pip install -r requirements.txt

Go to the index-tool directory:

cd amazon-documentdb-tools/index-tool

Run the command to dump indexes from your source database:

python3 migrationtools/documentdb_index_tool.py --dump-indexes --uri 'mongodb://<YourUser>:<YourPassword>@<SourceIp>:<Port>' --dir index.dir

Run the command to restore indexes to your Amazon DocumentDB elastic cluster:

python3 migrationtools/documentdb_index_tool.py --restore-indexes --skip-incompatible --skip-id-indexes --dir index.dir --uri 'mongodb://<YourUser>:<YourPassword>@<YourClusterEndpoint>/<YourDatabase>?tls=true&retryWrites=false'

When restoring the indexes, you should skip the restore of the default index on Object_id '_id' of each collection using the option --skip-id-indexes.

Restore data using mongorestore

Now you can start a full load and restore the dumped BSON data using mongorestore as follows:

mongorestore -h docdb-cluster1-xxxx.us-east-1.docdb-elastic.amazonaws.com \
--ssl -u <YourUser> -p <YourPassword> -c <CollectionName> -d <database name> \
--dir=/backup/<databaseName>/<CollectionName>.bson \
--numInsertionWorkersPerCollection=16 --noIndexRestore > mongorestore_log.out

By default, mongorestore automatically migrates indexes after migrating data. Because we completed migrating the indexes in the last step, we choose not to migrate the index during mongorestore (by adding the option --noIndexRestore).

The command includes the following parameters:

–numInsertionWorkersPerCollection – Specifies the number of workers for a concurrent import, which is not directly related to the number of shards in the Amazon DocumentDB elastic cluster
–dir – Specifies the absolute directory of the BSON file
–noIndexRestore – Specifies that no index will be migrated during the restore

Monitor the restore progress in mongorestore_log.out. When you see the message document(s) restored successfully, the restore is complete.

Monitor restore metrics

You can use Amazon CloudWatch to view the restoring counts per second of each shard.

On the CloudWatch console, choose All metrics in the navigation pane.
Choose DocDB Elastic and search for DocumentsInserted.
Check that the shards show in the dashboard, as shown in the following screenshot.

There are three shards in this case, and the data writing rate of each shard is about 220,000 documents per second.

Set up and run an AWS DMS parallel CDC task

After the full load, you need to set up the CDC task. We recommend enabling parallel apply to improve the replication rate.

Create source and target endpoints

Create the source endpoint for DocumentDB instance-based cluster as normal for an AWS DMS migration task. For more information, see Working with AWS DMS endpoints.

To create the target endpoint for the Amazon DocumentDB elastic cluster, you must add an attribute named ReplicateShardCollections for the target endpoint (with the --doc-db-settings '{"ReplicateShardCollections": true}' JSON syntax). This allows AWS DMS to replicate data to the target shard collections. For details, refer to Using endpoint settings with Amazon DocumentDB as a target.

Create an AWS DMS replication instance

For instructions on creating an AWS DMS replication instance, see Working with an AWS DMS replication instance.

You must use AWS DMS version 3.5.1 or higher in order to use the AWS DMS parallel apply method with the Amazon DocumentDB elastic cluster.

Create an AWS DMS CDC task

Complete the following steps to create an AWS DMS CDC task:

On the AWS DMS console, create a new migration task.
For Task identifier, enter a name.
Choose your source instance, source database endpoint, and target database endpoint.
For Migration type, choose Replicate data changes only.

For CDC start mode for source transactions, select Enable custom CDC start mode to declare the timestamp to start capturing the change stream.
For Specify start time, enter the start time for when mongodump starts, which you collected in an earlier step.
For Target table preparation mode, select Do nothing.

This means that AWS DMS will ignore the sharding-enabled collection you have created. Otherwise, AWS DMS rebuilds it as a normal non-sharding collection, which deviates from our original intention of horizontal sharding.

For Task logs, select Turn on CloudWatch logs to monitor the tasks.

Modify parallel CDC parameters

After the task is created, it should be in the Ready state (not started). Now you need modify the following three parameters of the AWS DMS CDC task to enable parallel apply to accelerate the replication task:

"ParallelApplyBufferSize": 1000,
"ParallelApplyQueuesPerThread": 200,
"ParallelApplyThreads": 16,

For more information about modifying these parameters, refer to Accelerate migrations to Amazon DocumentDB using AWS DMS.

Run the CDC task

Now you can start the CDC task. Complete the following steps:

On the AWS DMS console, choose Database migration tasks in the navigation pane.
Select the CDC task and on the Actions menu, choose Restart/Resume.

This starts the data synchronization from the source database to Amazon DocumentDB.

If the CDC task failed or was suspended, and you had modified cdc-start-time before starting, select Restart. If you select Resume, the CDC process will pull the change stream from the time point of the last suspension or failure. If that occurs, the cdc-start-time you specified will be ineffective.
Choose Start task.

Monitor the CDC task

After the task starts, if you need to know the status of the CDC task, you have two ways to monitor the task.

First, you can monitor the CloudWatch metrics of AWS DMS. For more details, refer to Monitoring replication tasks using Amazon CloudWatch. As shown in the following screenshot, we can see the indicators of the CDC latency source and CDC latency target. If the gap isn’t reduced, there may be a CDC problem. If the gap is gradually shrinking, it means that the target is gradually catching up.

You can also monitor CloudWatch logs and monitor the CDC progress. For more details, refer to View the logs of a DMS task. Through the CloudWatch logs, you can see the details of the specific CDC task run. For example, if the message [TARGET_APPLY]I: Working in bulk apply mode is listed, it means that parallel apply has been enabled successfully.

Now you have set up an end-to-end migration task to migrate data from your source cluster to an Amazon DocumentDB elastic cluster.

Change the application endpoint to an Amazon DocumentDB cluster

After the full load is complete and the CDC replicating lag is small, stop writing on the source, monitor the lag until it reaches zero, and then reroute the application to use your Amazon DocumentDB elastic cluster.

Clean up

To avoid unnecessary cost, delete the resources you created as part of this post:

To delete the EC2 instance, refer to Terminate your instance.
To delete the AWS DMS replication instance, refer to Deleting a replication instance.
To delete the Amazon DocumentDB cluster, refer to Deleting an Amazon DocumentDB cluster.

Conclusion

Amazon DocumentDB elastic clusters offer a horizontal scaling solution for writes and reads on document-oriented databases. In this post, we showed how to use the hybrid migration approach migrate to an Amazon DocumentDB elastic cluster. We also introduced best practices to improve the migration speed of that migration method. If the storage size of your document database is approaching the storage limit of a single cluster, you should split it or shard it as soon as possible.

If you have any questions, leave them in the comments section.

About the Author

Chuan Jin is a Senior Database Solutions Architect in the AWS Greater China region, dedicated to building technical solutions based on AWS databases. He has been working in the database domain for more than 10 years. He is familiar with MySQL, PostgreSQL, and Amazon DocumentDB. He specializes in database architecture, performance tuning, data migration, and data analysis.

AWS Database Blog