Comparison of test_decoding and pglogical plugins in Amazon Aurora PostgreSQL for data migration using AWS DMS

AWS Database Migration Service (AWS DMS) is a popular service used for homogeneous and heterogeneous database migrations. For homogeneous migrations such as PostgreSQL to PostgreSQL, DMS provides an option to use native database replication features. These replication features are essential for implementing continuous data capture (CDC), which is necessary to reduce downtime during database migration.

In this post, we provide details on two PostgreSQL plugins available for use by AWS DMS. We compare these plugin options and share test results to help database administrators understand the best practices and benefits of each plugin when working on migrations.

Prerequisites

This post is intended for PostgreSQL and Migration Professionals. We assume that readers have the following prerequisites:

Knowledge of AWS DMS
Familiarity with PostgreSQL and logical replication
Knowledge of AWS services such as Amazon CloudWatch and Amazon Aurora PostgreSQL-Compatible Edition

Overview

During CDC migration tasks (full load and CDC or CDC only) AWS DMS uses logical replication slots to retain WAL logs for replication until the logs are decoded. It primarily supports two different logical decoding plugins:

test_decoding: An output plugin that is part of the postgresql-contrib package.
pglogical: A PostgreSQL extension developed by 2nd Quadrant (now EnterpriseDB), based on a publisher / subscriber approach.

One of the key parameters that impacts the performance of CDC operations is the PostgreSQL parameter logical_decoding_work_mem. In the first section of this blog post, we demonstrate how to tune this parameter by tracking disk spills at the query level. Disk spills can significantly impact replication and overall database engine performance.

In the second section, we show an in-depth performance impact comparison between the test_decoding and pglogical decoding plugins.

Section 1: Tracking disk spill during logical replication

In scenarios that involve any long-running transactions, AWS DMS will wait for the transactions to commit before consuming any other transactions from the replication slot. Throughout this process, both the test_decoding and pglogical plugins are actively engaged in decoding all transactions in the write-ahead log (WAL). If logical_decoding_work_mem buffer is filled, the excess is written to disk. You can monitor the cumulative effect of these disk spills using the pg_stat_replication_slots view.

The following query shows spilled files details such as spill_count and spill_bytes.

postgres=> select * from pg_stat_replication_slots where slot_name = 'replication_slot';
-[ RECORD 1 ]+-----------------
slot_name    | replication_slot
spill_txns   | 1
spill_count  | 53
spill_bytes  | 3544080818
stream_txns  | 0
stream_count | 0
stream_bytes | 0
total_txns   | 1
total_bytes  | 66327798
stats_reset  |

The parameter logical_decoding_work_mem specifies the memory allowance for the logical decoding process, governing the decoding and streaming mechanisms. Within the logical decoding process, WAL records are translated into SQL statements and then transmitted to the DMS task. It’s essential to ensure that the entire transaction fits within the memory allocated by logical_decoding_work_mem, because any overflow results in records spilling onto disk—similar to the impact of work_mem and temporary files for a transaction—ultimately leading to a slowdown in the overall process.

Optimizing and adjusting the logical_decoding_work_mem parameter in line with the workload’s demands can significantly alleviate the strain caused by disk spills during batch loads, ensuring a smoother and more efficient replication process. Understanding these intricacies is crucial for fine-tuning the configuration and resource allocation, thereby enhancing the performance and resilience of the CDC process within AWS DMS.

To adjust the logical_decoding_work_mem parameter during long-running workloads, we recommend capturing the spill file information from the aurora_stat_file() function using Tuning memory parameters for Aurora PostgreSQL.

To implement this solution, the high-level steps are:

Connect to the source Aurora PostgreSQL using the database credentials.
Create a new table similar to the following to gather spill file information.

CREATE TABLE spill_file_tracking AS
    SELECT now() AS spill_time,*
    FROM aurora_stat_file()
    WHERE filename LIKE '%<all_slot_name>%';

Capture the result set from the aurora_stat_file()

INSERT INTO spill_file_tracking
  SELECT now(),*
  FROM aurora_stat_file()
  WHERE filename LIKE '%<all_slot_name>%';

Use \watch to continuously capture spill file information based on the required interval. The following command repeats the current query buffer after every 1 second.

\watch 1

Open another PostgreSQL session to the database and run the following query to retrieve information about spill file usage.

SELECT spill_time, split_part (filename, '/', 2)
    AS slot_name, count(1)
    AS spills, sum(used_bytes)
    AS slot_total_bytes, pg_size_pretty(sum(used_bytes))
    AS slot_total_size FROM spill_file_tracking
  GROUP BY 1,2 ORDER BY 1;

After you see spill file details, go back to the \watch 1 session and press Ctrl + C to stop the monitoring function.

This information will be helpful to understand the number of spills and the total size of spills happening at a particular slot, thereby fine-tuning the logical_decoding_work_mem parameter to reduce these disk spills. This helps improve the decoding process at the replication slot.

Section 2: Performance impact comparison between the test_decoding and pglogical decoding plugins

In this section, we review different test scenarios to compare the advantages and disadvantages of the two PostgreSQL plugins primarily used in AWS DMS.

Solution overview

The following diagram shows the architectural overview of the resources created for comparing and testing test_decoding and pglogical.

Configuration

Notes:

For this comparison, we used Aurora PostgreSQL 14.9 on r6g.2xlarge instance and AWS DMS 3.5.2. Other components like storage, network, DB parameters were same for both cases, except the logical decoding plugin.
AWS DMS uses either test_decoding or pglogical plugin for logical decoding. If the pglogical plugin is available on a source PostgreSQL database, DMS creates a replication slot using pglogical, otherwise a test_decoding plugin is used.

For our tests, we use two tables:

seat

\d seat
                          Table "dms_sample.seat"
      Column       |         Type          | Collation | Nullable | Default
-------------------+-----------------------+-----------+----------+---------
 sport_location_id | double precision      |           | not null |
 seat_level        | smallint              |           | not null |
 seat_section      | character varying(15) |           | not null |
 seat_row          | character varying(10) |           | not null |
 seat              | character varying(10) |           | not null |
 seat_type         | character varying(15) |           |          |
Indexes:
    "seat_sport_location_pk" PRIMARY KEY, btree (sport_location_id)

ticket_purchase_hist

\d ticket_purchase_hist
                          Table "dms_sample.ticket_purchase_hist"
          Column          |              Type              | Collation | Nullable | Default
--------------------------+--------------------------------+-----------+----------+---------
 sporting_event_ticket_id | double precision               |           | not null |
 purchased_by_id          | double precision               |           | not null |
 transaction_date_time    | timestamp(0) without time zone |           | not null |
 transferred_from_id      | double precision               |           |          |
 purchase_price           | numeric(8,2)                   |           | not null |
Indexes:
    "ticket_purchase_hist_pk" PRIMARY KEY, btree (sporting_event_ticket_id, purchased_by_id, transaction_date_time)
    "tph_purch_by_id" btree (purchased_by_id)
    "tph_trans_from_id" btree (transferred_from_id)

Note: These tables are populated and DML load generated using pgbench custom script and PostgreSQL functions like random() and generate_series().

We conduct two tests on the above tables with 2 decoding plugins.

Test 1: we run transactions against seat table. This table is not being replicated, it is not included in the AWS DMS CDC task. We will refer to this table as non-CDC table.
Test 2: we run transactions against ticket_purchase_hist table. This table is replicated and part of the AWS DMS CDC task. We will refer to this table as CDC table.

For each test case, we observe and analyze the performance of each plugin using these specified Amazon CloudWatch metrics.

ReplicationSlotDiskUsage: A CloudWatch metric for Amazon Relational Database Service (Amazon RDS) that provides information about the disk space used by replication slots in Amazon RDS and Aurora for PostgreSQL instances.
TransactionLogsDiskUsage: A CloudWatch metric for Amazon RDS that provides information about the disk space used by transaction logs on an Amazon RDS and Aurora instance.
CDCLatencySource: A CloudWatch metric for AWS DMS that provides information about the latency in replicating data (CDC) from source database to the replication host.
CDCLatencyTarget: A CloudWatch metric for AWS DMS that provides information about the latency in replicating data (CDC) from the replication host to the target database.
CDCThroughputRowsSource: A CloudWatch metric for the AWS DMS replication task, providing information about the of data changes (CDC) captured from the source databases in rows per second.
CDCIncomingChanges: A CloudWatch metric for Amazon RDS that provides information about the number of incoming data changes (CDC) captured by AWS DMS from the source database.

Test 1: Load on non-CDC tables

There are two test cases for non-CDC tables:

Load on non-CDC table using replication slot test_decoding
Load on non-CDC table using replication slot pglogical

Replication slot with test_decoding

In Test 1-A, we initiated a long-running transaction on the seat table, to understand the impact on the replication slot. The following table describes the configuration used for the replication slot with test_decoding.

Database	Migration task	Role	Table part of migration task	Type of load	Plugin
Database-1	testdecodingOn NonCDCTable	Source	Dms_sample.seat	Transactions created on non-CDC table	Test_decoding
Database-2	testdecodingOn NonCDCTable	Target	Dms_sample.seat	Transactions created on non-CDC table	Test_decoding

test_decoding: Migrating database-1 to database-2

The following two graphs show the CloudWatch metrics of writer instance on source cluster: ReplicationSlotDiskUsage and TransactionLogsDiskUsage.

Metric name: ReplicationSlotDiskUsage

Based on the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the test_decoding plugin.

Metric name: TransactionLogsDiskUsage

Based on the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.

The following graphs show CDC latency source, CDC latency target, CDC throughput rows source for the Amazon CloudWatch metrics of the AWS DMS task (testdecodingOnNonCDCTable).

Metric name: CDC latency source

Metric name: CDC latency target

In the following graph, we observe that latency increased with the increased disk usage at the replication slot on the target.

Metric name: CDC throughput rows source

In the following graph, we observe zero throughput because transactions are happening on tables that are not part of CDC.

We observe latency on the AWS DMS task CloudWatch metrics for the preceding long-running transaction. The main reason for latency is that the entire logical decoded data is sent to AWS DMS to filter the data based on the defined rule set. In our case, we created load on tables that aren’t part of CDC. When using the test_decoding plugin, CDC table filters aren’t applied at the replication slot but in AWS DMS, creating latency.

Replication slot with pglogical

In Test 1-B, we initiated a long-running transaction on the seat table to understand the impact of the pglogical plugin. The following table shows the configuration used for the replication slot with pglogical.

Database	Migration task	Role	Table part of migration task	Type of load	Plugin
Database-3	pglogicalOn NonCDCTable	Source	Dms_sample.seat	Transactions created on non-CDC table	pglogical
Database-4	pglogicalOn NonCDCTable	Target	Dms_sample.seat	Transactions created on non-CDC table	pglogical

The following CloudWatch graphs show ReplicationSlotDiskUsage and TransactionLogsDiskUSage of the writer instance on the source cluster.

pglogical: Migrating database-3 to database-4

Metric name: ReplicationSlotDiskUsage

Metric name: TransactionLogsDiskUsage

Metric name: CDC latency source

The following graphs show metrics of the AWS DMS task (pglogicalOnNonCDCTable) for CDC latency source, CDC latency target, and CDC throughput rows source.

Metric name: CDC latency target

Metric name: CDC throughput rows source

No latency is observed in AWS DMS task metrics for replication using the pglogical plugin for the same long-running transaction. The reason for not observing latency is that the logical decoded data is filtered for the tables not being replicated (not part of CDC task) before DMS consumes these changes.

Comparison of test_decoding and pglogical for Test 1

The following graphs show a comparison of CloudWatch metrics of the source instance in both use cases.

The orange line represents data sourced from database-1 using the test_decoding plugin, while the blue line depicts data from database-3 using the pglogical plugin. We observe a sudden dip in the blue line when compared to the relatively flat orange line. This behavior can be attributed to the filtering mechanism used by the pglogical plugin at the replication slot level.

Transactions involving tables that are not part of the CDC process are filtered out, resulting in the sudden dip observed in the blue line. On the other hand, the test_decoding plugin does not filter transactions based on CDC participation, leading to a flat line despite transactions occurring on tables that are not part of the CDC process.

Test 1 Observations

Looking at the comparison metric report, we observed that the PostgreSQL replication slot with the test_decoding plugin (database-1-instance-1) waited for the DMS task to consume the entire transaction after decoding. The same didn’t happen with the pglogical plugin (database-3-instance-1). Also, latency was observed on DMS task metrics for the test_decoding replication, which wasn’t the same with the pglogical replication. The reason for both observations is that filtration was happening at the replication slot for non-CDC tables while using the pglogical plugin. This didn’t happen with the test_decoding plugin.

During the long-running transaction, when the delayed parallel insert statement runs on the CDC table, we observed increased delays in processing these insert statements by test_decoding when compared to pglogical. This is because test_decoding couldn’t filter any changes at the replication slot, and the entire WAL is decoded and sent to the AWS DMS task for filtration. To preserve the order of transactions, the delayed insert statement will be decoded with the long-running transaction, but the filtration will be performed and applied only after the AWS DMS task consumes the replication slot changes. The DMS task consumes these changes after the commit is seen for the long-running transaction and therefore latency is observed. On the other hand, because filtration is done at replication_slot for pglogical, any parallel data manipulation language (DML) activity on CDC tables is consumed and applied by the AWS DMS task. Even though the long-running transaction on the non-CDC table is being decoded by the pglogical replication slot, other DML activity on CDC tables isn’t hindered. The main reason for this is that the long-running transaction on non-CDC tables currently being decoded doesn’t need to be consumed by the DMS task, which is currently known for the pglogical replication slot. Therefore, preserving the order of transactions only for CDC tables, changes are consumed by the DMS task and are applied.

Test 2: Load on CDC tables

In this test scenario, we create a new replication slot on database-1 (source) with the test_decoding plugin and a respective new AWS DMS endpoint and new DMS task, including the CDC table where the load will be initiated. We also create a new replication slot on database-3 (source) with the pglogical plugin and respective new DMS endpoint and new DMS task, including the CDC table.

Similar to the earlier test, we will be testing two cases

Load on CDC table using replication slot test_decoding
Load on CDC table using replication slot pglogical

Replication slot with test_decoding

In Test 2-A, we initiated a long-running transaction on the ticket_purchase_hist table—which is replicated (part of the CDC task) to understand the impact on the replication slot. The following table shows the configuration used for the replication slot with test_decoding.

Database	Migration task	Role	Table part of migration task	Type of load	Plugin
Database-1	testdecodingOnCDCTable	Source	Dms_sample. ticket_purchase_hist	Transactions created on CDC table	Test_decoding
Database-2	testdecodingOnCDCTable	Target	Dms_sample. ticket_purchase_hist	Transactions created on CDC table	Test_decoding

test_decoding: Migrating database-1 to database-2

The following graphs show the CloudWatch metrics of the writer instance on the source cluster for ReplicationSlotDiskUsage and TransactionLogsDiskUsage.

Metric name: ReplicationSlotDiskUsage

As shown in the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the test_decoding plugin

Metric name: TransactionLogsDiskUsage

As shown in the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.

The following graphs show the CloudWatch metrics of the AWS DMS task (testdecodingOnCDCTable) for CDC incoming changes, CDC latency source, CDC latency target, and CDC throughput rows source.

Metric name: CDC incoming changes

As shown in the following graph, we observe an increased number of incoming changes with the increased disk usage at the replication slot on the source because the transaction tables are part of CDC.

Metric name: CDC latency source

As shown in the following graph, we observe increased latency with the increased disk usage at replication slot on source.

Metric name: CDC latency target

As shown on the following graph, we observe increased latency with the increased disk usage at replication slot on target.

Metric name: CDC throughput rows source

As shown in the following graph, we observe increased throughput because transactions are happening on tables that are part of CDC.

The following screenshot from the AWS DMS console shows INSERT statistics for ticket_purchase_hist table.

The CloudWatch metrics for the AWS DMS task reveal latency associated with long-running transactions. Because we have incorporated tables involved in transactions into the CDC process, the entire decoded data is transmitted to AWS DMS regardless of whether the plugin filters data at the replication slot or at the AWS DMS level. This is because all transactions are occurring on tables that are part of the CDC process. Consequently, we expect to observe latency in the following graph, even when using the pglogical plugin.

Replication slot with pglogical

In Test 2-B, we wanted to observe latency created with a similar configuration using the pglogical plugin. The following table shows the configuration used for the replication slot with pglogical.

Database	Migration task	Role	Tables part of migration task	Type of load	Plugin
Database-3	pglogicalOnCDCTable	Source	Dms_sample. ticket_purchase_hist	Transactions created on CDC table	pglogical
Database-4	pglogicalOnCDCTable	Target	Dms_sample. ticket_purchase_hist	Transactions created on CDC table	pglogical

Pglogical: Migrating database-3 to database-4

The following graphs show the CloudWatch metrics of the instance on the source cluster.

Metric name: ReplicationSlotDiskUsage

Based on the following graph, we observe that disk usage increased on the source database during the long-running transaction when using the pglogical plugin

Metric name: TransactionLogsDiskUsage

Based on the following graph, we observe that disk usage for transaction logs increased on the source database during the long-running transaction.

The following graphs show the CloudWatch metrics of the DMS task (pglogicalOnCDCTable) for CDC incoming changes, CDC latency source, CDC latency target, and CDC throughput rows source.

Metric name: CDC incoming changes

As shown in the following graph, we observe an increased number of incoming changes with the increased disk usage at the replication slot on source because transactions are happening on tables that are part of CDC.

Metric name: CDC latency source

As shown in the following graph, we observe increased latency with the increased disk usage at the replication slot on source because transactions are happening on tables that are part of CDC.

Metric name: CDC latency target

As shown in the following graph, we observe increased latency with the increased disk usage at the replication slot on the source because transactions are happening on tables that are part of CDC.

Metric name: CDC throughput rows source

As shown in the following graph, we observe increased throughput because transactions are happening on tables that are part of CDC.

The following screenshot shows INSERT statistics for ticket_purchase_hist table in the AWS DMS console.

As anticipated, we observe similar trends in CloudWatch metrics even for the pglogical plugin.

Comparison of test_decoding and pglogical for Test 2

The following CloudWatch metric graphs illustrate the resource utilization patterns for ReplicationSlotDiskUsage and TranslationLogsDiskUsage when using the test_decoding and pglogical plugins during a data migration task. Recapping the configuration, we used test_decoding for database-1 to database-2 migration and pglogical for database-3 to database-4 migration. By analyzing these graphs, we can compare the resource consumption and identify the more efficient plugin for our specific migration scenario.

The orange line represents data sourced from database-1 using the test_decoding plugin, while the blue line shows data from database-3 using the pglogical plugin. Our analysis suggests that the test_decoding and pglogical graphs should exhibit similar patterns. However, the sudden dip in the blue line can be attributed to the presence of two tables in the DMS task, where one table is part of the CDC process and the other is not. Because pglogical filters data at the replication slot level, there is a sudden dip in the graph.

Test 2 Observations

We observed no major advantage in Test 2 when comparing test_decoding and pglogical with load initiated on CDC tables. Both decoding plugins resulted in similar trends in various CloudWatch metrics.

Increasing the number of DMS tasks leads to an increase in ReplicationSlotDiskUsage and increased disk spills. This can elevate latency issues because of the long-running transactions.

Conclusion

In this post, in section 1 we demonstrated how we can efficiently capture the spill file information using aurora_stat_file() function and a tracking table. That would help us in tuning the logical_decoding_work_mem parameter and improve performance of AWS DMS CDC process. In section 2 we compared the test_decoding and pglogical plugins when replicating data between Amazon Aurora PostgreSQL-Compatible instances with AWS DMS using a specific workload. Based on our observations, we can say each of the two plugins have certain advantages which can help improve AWS DMS CDC performance in some of these use cases:

pglogical:

When AWS DMS encounters bottlenecks (such as, stalled tasks, high latency), pglogical offers an advantage over test_decoding by filtering non-CDC tables at the replication slot.
Because the pglogical plugin filters data at the source replication slot, it comparatively reduces the amount of network throughput needed, especially when migrating from on-premises or other public clouds where higher network bandwidth is required.
The pglogical plugin is valid for workloads that don’t require all tables to be part of the CDC process because of the way filtration of non-CDC tables occurs.
If the bottleneck is observed at the replication slot during the decoding process, adjusting the logical_decoding_work_mem parameter can help reduce the spills to disk. Also, chunking the long-running transaction helps to reduce the disk spills at the replication slot.

test_decoding:

If no filtration of tables is required in a database, meaning all the tables in the source PostgreSQL database need to be part of the CDC process, test_decoding could be used.
test_decoding is preferred if you have a dependency on an entire row to be captured and transferred for any update or delete transaction.
test_decoding is preferred if most of the tables defined in the database don’t have primary key constraints because the entire row will be migrated during update or delete operations.

Along with choosing the right logical decoding plugin, it is crucial to adhere to migration best practices including:

Choosing appropriate instance sizes (AWS DMS replication instance, Aurora PostgreSQL)
designing migration tasks effectively
selecting suitable storage and network components
setting up monitoring with alerts

If you have any questions or suggestions, leave them in the comments section.

About the authors

Viswanatha Shastry Medipalli is a Senior Architect with the AWS ProServe team. His background spans a wide depth and breadth of expertise and experience in database migrations. He has architected and designed many successful database solutions that address challenging business requirements. He has built solutions using Oracle, SQL Server, and PostgreSQL databases for Reporting, Business intelligence (BI) applications and development support. He also has a good knowledge of automation and orchestration. His focus area is homogeneous and heterogeneous migrations of on-premises databases to Amazon.

Swanand Kshirsagar is an Architect within the Professional Services Expert Services division at AWS. He specializes in collaborating with clients to architect and implement scalable, robust, and security-compliant solutions in the AWS Cloud. His primary expertise lies in orchestrating seamless migrations, encompassing both homogenous and heterogeneous transitions, facilitating the relocation of on-premises databases to Amazon RDS and Amazon Aurora PostgreSQL with efficiency. Swanand has also delivered sessions and conducted workshops at conferences.

Abhilash Sajja is a Database Consultant on the Professional Services team at AWS. He works with customers and partners in their journey to the AWS Cloud with a focus on database migration and modernization programs. He uses his experience in different database technologies to offer guidance and technical support to customers migrating their on-premises database environment to AWS data stores.

AWS Database Blog

Comparison of test_decoding and pglogical plugins in Amazon Aurora PostgreSQL for data migration using AWS DMS

Prerequisites

Overview

Section 1: Tracking disk spill during logical replication

Section 2: Performance impact comparison between the test_decoding and pglogical decoding plugins

Solution overview

Configuration

Notes:

Test 1: Load on non-CDC tables

Replication slot with test_decoding

test_decoding: Migrating database-1 to database-2

Replication slot with pglogical

pglogical: Migrating database-3 to database-4

Comparison of test_decoding and pglogical for Test 1

Test 1 Observations

Test 2: Load on CDC tables

Replication slot with test_decoding

test_decoding: Migrating database-1 to database-2

Replication slot with pglogical

Comparison of test_decoding and pglogical for Test 2

Test 2 Observations

Conclusion

About the authors

Resources

Blog Topics

Follow