AWS Cloud Operations Blog
Leverage AWS Resilience Lifecycle Framework to assess and improve the resilience of application using AWS Resilience Hub
As more customers advance in their cloud adoption journey, they recognize that simply migrating applications to the cloud does not automatically ensure resilience. To ensure resilience, applications need to be designed to withstand disruptions from infrastructure, dependent services, misconfiguration and intermittent network connectivity issues. While many organizations understand the importance of building resilient applications, some struggle with where to start. To address this customer challenge, AWS released Resilience Lifecycle Framework in October 2023. This framework provides prescriptive guidance to customers on building resilience into all stages of an application’s lifecycle. The framework enables customers to design new applications with improved resilience and evaluate already deployed applications to identify optimization opportunities.
In this blog post, we will show how you can leverage AWS Resilience Lifecycle framework to efficiently assess and improve the resilience of your applications using AWS Resilience Hub. Resilience Hub is an AWS service that enables you to define, track, and manage the resilience of your applications. It helps you improve the resilience of your applications using AWS Well-Architected best practices. To demonstrate how Resilience Hub simplifies resilience evaluation of an application using Resilience Lifecycle Framework, we executed the post deployment activities within the Evaluate & Test stage of the Resilience Lifecyle framework on a sample application.
We created a 3-tier application using this CloudFormation template on AWS. Figure 3 below illustrates our solution architecture.
The application contained the following components:
- Application Load Balancer across two Availability Zones
- Amazon Auto Scaling group of EC2 instances
- Amazon Relational Database Instance
- Amazon Elastic File System for static content
Our objective was to evaluate this application to determine if it meets the desired business resilience goals in the event of a service disruption. The recovery point objective (RPO) goal was 1 hour (the maximum amount of acceptable data loss). The recovery time objective (RTO) was 4 hours (the maximum acceptable time to recover back to primary state after a failure). Next, we used AWS Resilience Hub to evaluate post-deployment activities prescribed in the AWS Resilience Lifecycle Framework.
Resilience Assessment
We imported the application to Resilience Hub and ran an assessment with a pre-defined resilience target with a goal of meeting business objectives for the application so as to not lose data or disrupt the user experience. The steps to run an application resilience assessment on Resilience Hub can be found here. The application assessment report showed that our resilience targets (RTO/RPO) cannot be met with the current application architecture. Figure 4 shows application’s compliance status as ‘Policy Breached’ and RTO/ RPO are unrecoverable as it did not meet the resiliency policy’s objectives for RTO and RPO.
Upon expanding each unrecoverable condition, we found that Resilience Hub identified the following resilience gaps in our application:
- Filesystem lacked backup plan to protect accident deletion or data corruption.
- RDS database was not designed to withstand an AWS AZ or a Region failure.
- Amazon EFS was not designed to withstand a Regional failure.
AWS Resilience Hub has a recommendations tab that provides recommendations based on the AWS Well-Architected Framework to address the identified gaps. Here is an example of Amazon EFS resilience improvement recommendations:
To improve the resilience of our application, we implemented the AWS Resilience Hub recommendations by updating the CloudFormation stack using an updated template . We ran a new Resilience Hub assessment and found the compliance status changed to ‘Policy met’ indicating that the application architecture improvements had satisfied the resilience policy goals.
Drift detection
To detect changes to our application’s configuration that may impact our application resilience targets, we simulated a drift by changing the EFS back up frequency from hourly to daily. We switched the backup frequency using the steps listed below:
- Open the AWS Backup console at https://console.thinkwithwp.com/backup
- In the navigation pane, choose Backup plans.
- Choose the backup plan ‘default’ and choose Edit.
- Choose the backup rule ‘daily-backups’ and click Edit. Change the Backup frequency to Daily and then choose Save.
When we ran the resilience assessment again, we found the application resiliency status had ‘drifted’.
The compliance status changed to ‘Policy breached’, as the estimated RPO had increased from one hour to a day. This exceeded the defined RPO threshold, indicating our application could no longer meet the RPO objectives.
With Resilience Hub drift detection, customers can detect application drift daily and get notified in the event of drift changes. This enables customers to maintain desired resiliency objectives.
Synthetic testing
The Resilience Lifecycle framework recommends creating configurable software that runs on a scheduled basis to test application APIs to simulate end-user experience, which are sometimes referred to as canaries. When you run a resilience assessment on Resilience Hub, the service provides you with many different recommendations to improve your application resilience. One of the improvement suggestions it provides is to create alarms to monitor key metrics. The alarms allow you to get notified and take corrective actions when configured thresholds are breached. Resilience Hub allows us to setup alarms to monitor synthetic canaries.
To demonstrate, we used Amazon CloudWatch Synthetics to create canaries to monitor our application endpoint every minute and get notifications via SNS in the event of an issue.
To deploy resources required for synthetic monitoring, we created a new CloudFormation Stack using this template. For the parameter ‘ResHubStackName’ we were required to enter the name of the CloudFormation stack that was created to deploy our application. Upon creation of the CloudFormation stack, our synthetic canary was visible in Amazon CloudWatch console.
Next, we navigated to the previously conducted Resilience Hub assessment and clicked on Operational recommendations. We clicked on the Alarms tab and used the filter ‘Canary’ to fetch Alarms recommendations for synthetic monitoring.
To deploy the synthetic canary alarm, we followed the step listed below:
- Selected the filtered Alarm and clicked on Create CloudFormation template button.
- Navigated to the Templates tab and clicked on the template we had just created.
- Clicked on the link under Templates S3 Path and navigated to the S3 location where the template was stored. Navigated into the Alarm folder to find the CloudFormation template (a JSON file).
- Selected the file and clicked on Copy URL to get the Amazon S3 URL within the template.
- Navigated to CloudFormation console and created a new Stack using the copied S3 URL as template path.
- Entered Stack name and followed the input parameters:
- CanaryName – ‘api-canary’
- SNSTopicARN – The ARN of an SNS topic to which alarm status changes (This must be in the same region as application resources).
Once the Stack creation was completed successfully, we ran a new Assessment and reviewed the updated operational recommendations to confirm whether the alarms had been created. As you will see in the visual below, we were able to confirm that the alarm had been created.
Chaos engineering
This is another process that mimics failure experiments on your applications in a controlled way. Chaos engineering is the practice of intentionally injecting failures into a system to test its resiliency and fault tolerance. Within Resilience Hub, you can create and run AWS Fault Injection Service (AWS FIS) experiments, which mimic real-life disruptions to your application to help you better understand dependencies and uncover potential weaknesses. Resilience Hub makes it easy to incorporate chaos experiments into your cloud applications. To setup chaos experiments in AWS Resilience Hub, read the AWS Chaos Engineering blog post.
Resilience Hub centralizes and automates much of the complexity associated with executing effective chaos experiments. You gain an automated mechanism to identify weaknesses and validate enhancements to your system’s fault tolerance. Rather than relying on manual, ad hoc experiments, you can establish a mature chaos engineering practice across your critical applications. AWS Resilience Hub furnishes a managed platform to seamlessly incorporate chaos experiments as part of a comprehensive resiliency validation strategy. Chaos engineering represents a best practice for modern cloud applications, and Resilience Hub streamlines the implementation.
Disaster Recovery (DR) Testing
Regularly exercising your DR Strategy is the only way to ensure that the workload will operate as designed and deliver the resilience you expect. Customers should regularly test DR procedures to ensure their team is fully prepared to overcome disaster scenarios. For example, if you have an application using two AWS regions as primary and backup locations, you should test the failover and failback mechanism from one Region to another to ensure recovery mechanisms work as designed. The Reliability Pillar white paper encompasses AWS best practices to test the resilience of your workload.
Conclusion
In this blog post, we showed you how you can use AWS Resilience hub to assess and improve the resilience of an applications using the Resilience Lifecycle Framework recommendations. By following the steps outlined in this blog, you can improve the resilience of your applications leading to increased customer satisfaction and improved business outcomes.