AWS Database Blog
Using RDS Proxy with Amazon RDS Multi-AZ DB instance deployment to improve planned failover time
Amazon Relational Database Service (Amazon RDS) Multi-AZ deployments provide a simple and effective solution for achieving high availability (HA) for databases. Amazon RDS Multi-AZ deployments can have one or two standby DB instances. When the deployment has one standby DB instance, it’s called a Multi-AZ DB instance deployment – which will be the focus of this post.
When you enable Multi-AZ DB instance deployment configuration, Amazon RDS creates a fully synchronized, redundant standby instance in another Availability Zone (AZ) to maintain business continuity in case of AZ failure. If your primary DB instance experiences issues with network connectivity, compute unit failure or storage failure, RDS detects the failure and automatically promotes the standby instance to the primary role. This process, known as a failover, helps maintain availability.
Failovers can be categorized as planned or unplanned:
- Planned failovers occur during administrative actions such as upgrading the operating system (OS) or modifying the instance class. You can manually invoke a planned failover through the Amazon RDS API reboot-db-instance –force-failover or through the Amazon RDS console for disaster recovery purposes.
- Unplanned failovers are invoked by unexpected issues such as loss of network connectivity, compute unit failure or storage failure on the primary.
In this post, we demonstrate improvements in planned failover downtime of Multi-AZ instance deployment with Amazon RDS Proxy, a result of several optimizations made by RDS.
Achieving HA through Amazon RDS Multi-AZ DB instance deployment with RDS Proxy
In an Amazon RDS Multi-AZ DB instance deployment, the primary instance (shown in yellow in the following figure) handles read/write traffic, and the standby instance (shown in red) remains on standby, ready to take over if needed.
The following diagram illustrates an Amazon RDS Multi-AZ DB instance deployment operating in its normal connected state. In this configuration, two active Amazon Elastic Compute Cloud (Amazon EC2) instances run in separate Availability Zones. Each instance manages a set of Amazon Elastic Block Storage (EBS) volumes containing a full copy of the data, with a storage-level replication layer connecting these volumes to the standby instance’s EBS volumes.
The database application (DB APP, shown in green in the preceding figure) uses DNS (shown in orange) to retrieve the address of the current external endpoint providing access to the data. In this example, DNS is directing the application (DB APP) to the primary instance, serving the primary copy of the data that is available in Availability Zone 1.
In the event of a failure, Amazon RDS automatically switches the roles of the primary and standby instances and updates the IP address associated with the database’s DNS (hostname). This allows client applications to maintain their connection settings during failover. This process, known as DNS propagation, can take up to 35 seconds to complete.
RDS Proxy eliminates the 35 seconds of DNS propagation delay by continuously monitoring both instances, allowing it to bypass DNS propagation. This allows RDS Proxy to deliver a faster failover response for client applications, maximizing availability during failovers. To set up RDS Proxy with your Amazon RDS Multi-AZ DB instance deployment, refer to Connecting to a database through RDS Proxy.
In a Multi-AZ DB instance deployment, Amazon RDS carries out maintenance operations such as Instance class modification and OS upgrades on the standby instance (step 1 of the following figure). After that, Amazon RDS performs a planned failover (step 2) once standby catches up with the primary, switching the standby to be the new primary, and finishes maintenance on the standby (old primary) (step 3). When complete, Amazon RDS reconnects both the primary and standby to resume storage level replication for achieving high availability. This approach reduces downtime because the only interruption to your application happens during the brief planned failover, which affects database connections and write operations. The following figure depicts the high-level process of how Amazon RDS performs most of its maintenance operations on Amazon RDS Multi-AZ DB instance deployment.
We have implemented several improvements to the planned failover process (Step 2), and database restart times for RDS for MySQL, MariaDB and PostgreSQL. When integrated with RDS Proxy, these optimizations have minimized downtime, ensuring smoother transitions with minimal impact on applications during maintenance operations such as instance class modifications, OS upgrades, and reboot with force failover for disaster recovery requirements.
Benchmarking
To assess the impact of these optimizations, we conducted 100 tests on an Amazon RDS Multi-AZ DB instance deployment integrated with RDS Proxy with minimal write workload. We averaged the write downtime before and after the optimizations. This downtime is tracked using an application that measures the period between the first write failure and the next successful write. In our testing, we observed up to 4.9X reduction in downtime during ‘instance modify’ operation, up to 4.8X reduction during ‘OS upgrade’, and up to 3X reduction during reboots with forced (planned) failovers. The results for each of the three services (RDS for MySQL, MariaDB and PostgreSQL) are shown in the figures below. These results are not absolute and may vary depending on your specific workloads.
The following graph compares the write downtime during the modify instance class operation from db.r5.xlarge to db.r5.large before and after optimizations using the default parameter group.
The following graph compares the write downtime during the OS upgrade operation before and after optimizations on instance class db.r5.xlarge using the default parameter group.
The following graph compares the write downtime during the reboot-with-force-failover operation before and after optimizations on instance class db.r5.xlarge using the default parameter group.
Note: Although Amazon RDS has optimized the planned failover downtime, including optimizing database start times, the overall failover process can still be affected by longer engine crash recovery times. Despite these advancements, extended crash recovery times may impact the speed of database restarts during failovers.
Conclusion
In this post, we showed you the improvements in downtime reduction possible by integrating Amazon RDS for MySQL, MariaDB or PostgreSQL Multi-AZ DB instance with RDS Proxy. The three areas with the maximum impact of these improvements are:
- Modify instance class – Performance improved by up to 4.9 times for Amazon RDS for MariaDB, 4.3 times for Amazon RDS for MySQL, and 3.3 times for Amazon RDS for PostgreSQL
- OS upgrades – Downtime reduced by up to 4 times for Amazon RDS for MariaDB, 4.8 times for Amazon RDS for MySQL, and 3.4 times for Amazon RDS for PostgreSQL
- Reboot with force failover – Downtime reduced by up to 3 times for Amazon RDS for MariaDB, 2.5 times for Amazon RDS for MySQL, and 1.5 times for Amazon RDS for PostgreSQL
These improvements are now available across all Amazon RDS for MySQL, MariaDB and PostgreSQL DB instances. You do not need to make any changes to your workload or DB instance to receive these benefits. We invite you to try out these operations on your DB instances to observe the impact of these improvements. If you have any questions, or feedback, do share with us in the comments section below.
About the author
Rajat Jain is a Software Development Engineer within the Amazon RDS Open Source Engines team. He specializes in architecting and implementing robust Control Plane components for open-source database engines. His expertise spans across performance optimization, scalability enhancements, and ensuring high availability for RDS Open Source database services.