AWS Database Blog
Comparison of test_decoding and pglogical plugins in Amazon Aurora PostgreSQL for data migration using AWS DMS
AWS Database Migration Service (AWS DMS) is a popular service used for homogeneous and heterogeneous database migrations. For homogeneous migrations such as PostgreSQL to PostgreSQL, DMS provides an option to use native database replication features. These replication features are essential for implementing continuous data capture (CDC), which is necessary to reduce downtime during database migration.
In this post, we provide details on two PostgreSQL plugins available for use by AWS DMS. We compare these plugin options and share test results to help database administrators understand the best practices and benefits of each plugin when working on migrations.
Prerequisites
This post is intended for PostgreSQL and Migration Professionals. We assume that readers have the following prerequisites:
- Knowledge of AWS DMS
- Familiarity with PostgreSQL and logical replication
- Knowledge of AWS services such as Amazon CloudWatch and Amazon Aurora PostgreSQL-Compatible Edition
Overview
During CDC migration tasks (full load and CDC or CDC only) AWS DMS uses logical replication slots to retain WAL logs for replication until the logs are decoded. It primarily supports two different logical decoding plugins:
test_decoding
: An output plugin that is part of the postgresql-contrib package.pglogical
: A PostgreSQL extension developed by 2nd Quadrant (now EnterpriseDB), based on a publisher / subscriber approach.
One of the key parameters that impacts the performance of CDC operations is the PostgreSQL parameter logical_decoding_work_mem
. In the first section of this blog post, we demonstrate how to tune this parameter by tracking disk spills at the query level. Disk spills can significantly impact replication and overall database engine performance.
In the second section, we show an in-depth performance impact comparison between the test_decoding
and pglogical
decoding plugins.
Section 1: Tracking disk spill during logical replication
In scenarios that involve any long-running transactions, AWS DMS will wait for the transactions to commit before consuming any other transactions from the replication slot. Throughout this process, both the test_decoding
and pglogical
plugins are actively engaged in decoding all transactions in the write-ahead log (WAL). If logical_decoding_work_mem buffer is filled, the excess is written to disk. You can monitor the cumulative effect of these disk spills using the pg_stat_replication_slots
view.
The following query shows spilled files details such as spill_count
and spill_bytes
.
The parameter logical_decoding_work_mem
specifies the memory allowance for the logical decoding process, governing the decoding and streaming mechanisms. Within the logical decoding process, WAL records are translated into SQL statements and then transmitted to the DMS task. It’s essential to ensure that the entire transaction fits within the memory allocated by logical_decoding_work_mem
, because any overflow results in records spilling onto disk—similar to the impact of work_mem
and temporary files for a transaction—ultimately leading to a slowdown in the overall process.
Optimizing and adjusting the logical_decoding_work_mem
parameter in line with the workload’s demands can significantly alleviate the strain caused by disk spills during batch loads, ensuring a smoother and more efficient replication process. Understanding these intricacies is crucial for fine-tuning the configuration and resource allocation, thereby enhancing the performance and resilience of the CDC process within AWS DMS.
To adjust the logical_decoding_work_mem
parameter during long-running workloads, we recommend capturing the spill file information from the aurora_stat_file()
function using Tuning memory parameters for Aurora PostgreSQL.
To implement this solution, the high-level steps are:
- Connect to the source Aurora PostgreSQL using the database credentials.
- Create a new table similar to the following to gather spill file information.
- Capture the result set from the
aurora_stat_file()
- Use
\watch
to continuously capture spill file information based on the required interval. The following command repeats the current query buffer after every 1 second.
\watch 1
- Open another PostgreSQL session to the database and run the following query to retrieve information about spill file usage.
After you see spill file details, go back to the \watch 1
session and press Ctrl + C
to stop the monitoring function.
This information will be helpful to understand the number of spills and the total size of spills happening at a particular slot, thereby fine-tuning the logical_decoding_work_mem
parameter to reduce these disk spills. This helps improve the decoding process at the replication slot.
Section 2: Performance impact comparison between the test_decoding and pglogical decoding plugins
In this section, we review different test scenarios to compare the advantages and disadvantages of the two PostgreSQL plugins primarily used in AWS DMS.
Solution overview
The following diagram shows the architectural overview of the resources created for comparing and testing test_decoding
and pglogical
.
Configuration
- Provision Aurora PostgreSQL
- Setup AWS DMS with PostgreSQL as a source
- Configure pglogical plugin for logical replication on Aurora PostgreSQL with AWS DMS
- Creating a replication slot using output plugin test_decoding
Notes:
- For this comparison, we used Aurora PostgreSQL 14.9 on r6g.2xlarge instance and AWS DMS 3.5.2. Other components like storage, network, DB parameters were same for both cases, except the logical decoding plugin.
- AWS DMS uses either
test_decoding
orpglogical
plugin for logical decoding. If thepglogical
plugin is available on a source PostgreSQL database, DMS creates a replication slot usingpglogical
, otherwise atest_decoding
plugin is used.
For our tests, we use two tables:
- seat
- ticket_purchase_hist
Note: These tables are populated and DML load generated using pgbench custom script and PostgreSQL functions like random()
and generate_series()
.
We conduct two tests on the above tables with 2 decoding plugins.
- Test 1: we run transactions against seat table. This table is not being replicated, it is not included in the AWS DMS CDC task. We will refer to this table as non-CDC table.
- Test 2: we run transactions against
ticket_purchase_hist
table. This table is replicated and part of the AWS DMS CDC task. We will refer to this table as CDC table.
For each test case, we observe and analyze the performance of each plugin using these specified Amazon CloudWatch metrics.
ReplicationSlotDiskUsage:
A CloudWatch metric for Amazon Relational Database Service (Amazon RDS) that provides information about the disk space used by replication slots in Amazon RDS and Aurora for PostgreSQL instances.TransactionLogsDiskUsage:
A CloudWatch metric for Amazon RDS that provides information about the disk space used by transaction logs on an Amazon RDS and Aurora instance.CDCLatencySource:
A CloudWatch metric for AWS DMS that provides information about the latency in replicating data (CDC) from source database to the replication host.CDCLatencyTarget:
A CloudWatch metric for AWS DMS that provides information about the latency in replicating data (CDC) from the replication host to the target database.CDCThroughputRowsSource:
A CloudWatch metric for the AWS DMS replication task, providing information about the of data changes (CDC) captured from the source databases in rows per second.CDCIncomingChanges:
A CloudWatch metric for Amazon RDS that provides information about the number of incoming data changes (CDC) captured by AWS DMS from the source database.
Test 1: Load on non-CDC tables
There are two test cases for non-CDC tables:
- Load on non-CDC table using replication slot
test_decoding
- Load on non-CDC table using replication slot
pglogical
Replication slot with test_decoding
In Test 1-A, we initiated a long-running transaction on the seat table, to understand the impact on the replication slot. The following table describes the configuration used for the replication slot with test_decoding
.
Database | Migration task | Role | Table part of migration task | Type of load | Plugin |
Database-1 | testdecodingOn NonCDCTable | Source | Dms_sample.seat | Transactions created on non-CDC table | Test_decoding |
Database-2 | testdecodingOn NonCDCTable | Target | Dms_sample.seat | Transactions created on non-CDC table | Test_decoding |
test_decoding: Migrating database-1 to database-2
The following two graphs show the CloudWatch metrics of writer instance on source cluster: ReplicationSlotDiskUsage and TransactionLogsDiskUsage.
Metric name: ReplicationSlotDiskUsage
Based on the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the test_decoding plugin.
Metric name: TransactionLogsDiskUsage
Based on the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.
The following graphs show CDC latency source, CDC latency target, CDC throughput rows source for the Amazon CloudWatch metrics of the AWS DMS task (testdecodingOnNonCDCTable).
Metric name: CDC latency source
Metric name: CDC latency target
In the following graph, we observe that latency increased with the increased disk usage at the replication slot on the target.
Metric name: CDC throughput rows source
In the following graph, we observe zero throughput because transactions are happening on tables that are not part of CDC.
We observe latency on the AWS DMS task CloudWatch metrics for the preceding long-running transaction. The main reason for latency is that the entire logical decoded data is sent to AWS DMS to filter the data based on the defined rule set. In our case, we created load on tables that aren’t part of CDC. When using the test_decoding plugin, CDC table filters aren’t applied at the replication slot but in AWS DMS, creating latency.
Replication slot with pglogical
In Test 1-B, we initiated a long-running transaction on the seat table to understand the impact of the pglogical plugin. The following table shows the configuration used for the replication slot with pglogical.
Database | Migration task | Role | Table part of migration task | Type of load | Plugin |
Database-3 | pglogicalOn NonCDCTable | Source | Dms_sample.seat | Transactions created on non-CDC table | pglogical |
Database-4 | pglogicalOn NonCDCTable | Target | Dms_sample.seat | Transactions created on non-CDC table | pglogical |
The following CloudWatch graphs show ReplicationSlotDiskUsage and TransactionLogsDiskUSage of the writer instance on the source cluster.
pglogical: Migrating database-3 to database-4
Metric name: ReplicationSlotDiskUsage
Metric name: TransactionLogsDiskUsage
Metric name: CDC latency source
The following graphs show metrics of the AWS DMS task (pglogicalOnNonCDCTable) for CDC latency source, CDC latency target, and CDC throughput rows source.
Metric name: CDC latency target
Metric name: CDC throughput rows source
No latency is observed in AWS DMS task metrics for replication using the pglogical plugin for the same long-running transaction. The reason for not observing latency is that the logical decoded data is filtered for the tables not being replicated (not part of CDC task) before DMS consumes these changes.
Comparison of test_decoding and pglogical for Test 1
The following graphs show a comparison of CloudWatch metrics of the source instance in both use cases.
The orange line represents data sourced from database-1 using the test_decoding
plugin, while the blue line depicts data from database-3 using the pglogical
plugin. We observe a sudden dip in the blue line when compared to the relatively flat orange line. This behavior can be attributed to the filtering mechanism used by the pglogical
plugin at the replication slot level.
Transactions involving tables that are not part of the CDC process are filtered out, resulting in the sudden dip observed in the blue line. On the other hand, the test_decoding
plugin does not filter transactions based on CDC participation, leading to a flat line despite transactions occurring on tables that are not part of the CDC process.
Test 1 Observations
Looking at the comparison metric report, we observed that the PostgreSQL replication slot with the test_decoding
plugin (database-1-instance-1) waited for the DMS task to consume the entire transaction after decoding. The same didn’t happen with the pglogical
plugin (database-3-instance-1). Also, latency was observed on DMS task metrics for the test_decoding
replication, which wasn’t the same with the pglogical
replication. The reason for both observations is that filtration was happening at the replication slot for non-CDC tables while using the pglogical
plugin. This didn’t happen with the test_decoding
plugin.
During the long-running transaction, when the delayed parallel insert statement runs on the CDC table, we observed increased delays in processing these insert statements by test_decoding
when compared to pglogical
. This is because test_decoding
couldn’t filter any changes at the replication slot, and the entire WAL is decoded and sent to the AWS DMS task for filtration. To preserve the order of transactions, the delayed insert statement will be decoded with the long-running transaction, but the filtration will be performed and applied only after the AWS DMS task consumes the replication slot changes. The DMS task consumes these changes after the commit is seen for the long-running transaction and therefore latency is observed. On the other hand, because filtration is done at replication_slot
for pglogical
, any parallel data manipulation language (DML) activity on CDC tables is consumed and applied by the AWS DMS task. Even though the long-running transaction on the non-CDC table is being decoded by the pglogical
replication slot, other DML activity on CDC tables isn’t hindered. The main reason for this is that the long-running transaction on non-CDC tables currently being decoded doesn’t need to be consumed by the DMS task, which is currently known for the pglogical
replication slot. Therefore, preserving the order of transactions only for CDC tables, changes are consumed by the DMS task and are applied.
Test 2: Load on CDC tables
In this test scenario, we create a new replication slot on database-1 (source) with the test_decoding
plugin and a respective new AWS DMS endpoint and new DMS task, including the CDC table where the load will be initiated. We also create a new replication slot on database-3 (source) with the pglogical
plugin and respective new DMS endpoint and new DMS task, including the CDC table.
Similar to the earlier test, we will be testing two cases
- Load on CDC table using replication slot
test_decoding
- Load on CDC table using replication slot
pglogical
Replication slot with test_decoding
In Test 2-A, we initiated a long-running transaction on the ticket_purchase_hist
table—which is replicated (part of the CDC task) to understand the impact on the replication slot. The following table shows the configuration used for the replication slot with test_decoding
.
Database | Migration task | Role | Table part of migration task | Type of load | Plugin |
Database-1 | testdecodingOnCDCTable | Source | Dms_sample. ticket_purchase_hist | Transactions created on CDC table | Test_decoding |
Database-2 | testdecodingOnCDCTable | Target | Dms_sample. ticket_purchase_hist | Transactions created on CDC table | Test_decoding |
test_decoding: Migrating database-1 to database-2
The following graphs show the CloudWatch metrics of the writer instance on the source cluster for ReplicationSlotDiskUsage
and TransactionLogsDiskUsage
.
Metric name: ReplicationSlotDiskUsage
As shown in the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the test_decoding plugin
Metric name: TransactionLogsDiskUsage
As shown in the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.
The following graphs show the CloudWatch metrics of the AWS DMS task (testdecodingOnCDCTable
) for CDC incoming changes, CDC latency source, CDC latency target, and CDC throughput rows source.
Metric name: CDC incoming changes
As shown in the following graph, we observe an increased number of incoming changes with the increased disk usage at the replication slot on the source because the transaction tables are part of CDC.
Metric name: CDC latency source
As shown in the following graph, we observe increased latency with the increased disk usage at replication slot on source.
Metric name: CDC latency target
As shown on the following graph, we observe increased latency with the increased disk usage at replication slot on target.
Metric name: CDC throughput rows source
As shown in the following graph, we observe increased throughput because transactions are happening on tables that are part of CDC.
The following screenshot from the AWS DMS console shows INSERT statistics for ticket_purchase_hist
table.
The CloudWatch metrics for the AWS DMS task reveal latency associated with long-running transactions. Because we have incorporated tables involved in transactions into the CDC process, the entire decoded data is transmitted to AWS DMS regardless of whether the plugin filters data at the replication slot or at the AWS DMS level. This is because all transactions are occurring on tables that are part of the CDC process. Consequently, we expect to observe latency in the following graph, even when using the pglogical
plugin.
Replication slot with pglogical
In Test 2-B, we wanted to observe latency created with a similar configuration using the pglogical
plugin. The following table shows the configuration used for the replication slot with pglogical
.
Database | Migration task | Role | Tables part of migration task | Type of load | Plugin |
Database-3 | pglogicalOnCDCTable | Source | Dms_sample. ticket_purchase_hist | Transactions created on CDC table | pglogical |
Database-4 | pglogicalOnCDCTable | Target | Dms_sample. ticket_purchase_hist | Transactions created on CDC table | pglogical |
Pglogical: Migrating database-3 to database-4
The following graphs show the CloudWatch metrics of the instance on the source cluster.
Metric name: ReplicationSlotDiskUsage
Based on the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the pglogical
plugin
Metric name: TransactionLogsDiskUsage
Based on the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.
The following graphs show the CloudWatch metrics of the DMS task (pglogicalOnCDCTable
) for CDC incoming changes, CDC latency source, CDC latency target, and CDC throughput rows source.
Metric name: CDC incoming changes
As shown in the following graph, we observe an increased number of incoming changes with the increased disk usage at the replication slot on source because transactions are happening on tables that are part of CDC.
Metric name: CDC latency source
As shown in the following graph, we observe increased latency with the increased disk usage at the replication slot on source because transactions are happening on tables that are part of CDC.
Metric name: CDC latency target
As shown in the following graph, we observe increased latency with the increased disk usage at the replication slot on the source because transactions are happening on tables that are part of CDC.
Metric name: CDC throughput rows source
As shown in the following graph, we observe increased throughput because transactions are happening on tables that are part of CDC.
The following screenshot shows INSERT statistics for ticket_purchase_hist
table in the AWS DMS console.
As anticipated, we observe similar trends in CloudWatch metrics even for the pglogical plugin.
Comparison of test_decoding and pglogical for Test 2
The following CloudWatch metric graphs illustrate the resource utilization patterns for ReplicationSlotDiskUsage
and TranslationLogsDiskUsage
when using the test_decoding
and pglogical
plugins during a data migration task. Recapping the configuration, we used test_decoding
for database-1 to database-2 migration and pglogical
for database-3 to database-4 migration. By analyzing these graphs, we can compare the resource consumption and identify the more efficient plugin for our specific migration scenario.
The orange line represents data sourced from database-1 using the test_decoding
plugin, while the blue line shows data from database-3 using the pglogical
plugin. Our analysis suggests that the test_decoding
and pglogical
graphs should exhibit similar patterns. However, the sudden dip in the blue line can be attributed to the presence of two tables in the DMS task, where one table is part of the CDC process and the other is not. Because pglogical
filters data at the replication slot level, there is a sudden dip in the graph.
Test 2 Observations
We observed no major advantage in Test 2 when comparing test_decoding
and pglogical
with load initiated on CDC tables. Both decoding plugins resulted in similar trends in various CloudWatch metrics.
Increasing the number of DMS tasks leads to an increase in ReplicationSlotDiskUsage
and increased disk spills. This can elevate latency issues because of the long-running transactions.
Conclusion
In this post, in section 1 we demonstrated how we can efficiently capture the spill file information using aurora_stat_file()
function and a tracking table. That would help us in tuning the logical_decoding_work_mem
parameter and improve performance of AWS DMS CDC process. In section 2 we compared the test_decoding
and pglogical
plugins when replicating data between Amazon Aurora PostgreSQL-Compatible instances with AWS DMS using a specific workload. Based on our observations, we can say each of the two plugins have certain advantages which can help improve AWS DMS CDC performance in some of these use cases:
- When AWS DMS encounters bottlenecks (such as, stalled tasks, high latency),
pglogical
offers an advantage over test_decoding by filtering non-CDC tables at the replication slot. - Because the
pglogical
plugin filters data at the source replication slot, it comparatively reduces the amount of network throughput needed, especially when migrating from on-premises or other public clouds where higher network bandwidth is required. - The
pglogical
plugin is valid for workloads that don’t require all tables to be part of the CDC process because of the way filtration of non-CDC tables occurs. - If the bottleneck is observed at the replication slot during the decoding process, adjusting the
logical_decoding_work_mem
parameter can help reduce the spills to disk. Also, chunking the long-running transaction helps to reduce the disk spills at the replication slot.
- If no filtration of tables is required in a database, meaning all the tables in the source PostgreSQL database need to be part of the CDC process,
test_decoding
could be used. test_decoding
is preferred if you have a dependency on an entire row to be captured and transferred for any update or delete transaction.test_decoding
is preferred if most of the tables defined in the database don’t have primary key constraints because the entire row will be migrated during update or delete operations.
Along with choosing the right logical decoding plugin, it is crucial to adhere to migration best practices including:
- Choosing appropriate instance sizes (AWS DMS replication instance, Aurora PostgreSQL)
- designing migration tasks effectively
- selecting suitable storage and network components
- setting up monitoring with alerts
If you have any questions or suggestions, leave them in the comments section.
About the authors
Viswanatha Shastry Medipalli is a Senior Architect with the AWS ProServe team. His background spans a wide depth and breadth of expertise and experience in database migrations. He has architected and designed many successful database solutions that address challenging business requirements. He has built solutions using Oracle, SQL Server, and PostgreSQL databases for Reporting, Business intelligence (BI) applications and development support. He also has a good knowledge of automation and orchestration. His focus area is homogeneous and heterogeneous migrations of on-premises databases to Amazon.
Swanand Kshirsagar is an Architect within the Professional Services Expert Services division at AWS. He specializes in collaborating with clients to architect and implement scalable, robust, and security-compliant solutions in the AWS Cloud. His primary expertise lies in orchestrating seamless migrations, encompassing both homogenous and heterogeneous transitions, facilitating the relocation of on-premises databases to Amazon RDS and Amazon Aurora PostgreSQL with efficiency. Swanand has also delivered sessions and conducted workshops at conferences.
Abhilash Sajja is a Database Consultant on the Professional Services team at AWS. He works with customers and partners in their journey to the AWS Cloud with a focus on database migration and modernization programs. He uses his experience in different database technologies to offer guidance and technical support to customers migrating their on-premises database environment to AWS data stores.