AWS Cloud Operations Blog
How to use the AWS Resilience Hub score
Time to read | 10 minutes |
Time to complete | 1 hour |
Cost to complete | $15 per day (WordPress Multi-AZ application, AWS ResilienceHub Application and recommendations |
Learning level | 200 – Intermediate |
Services used | AWS ResilienceHub, AWS CloudFormation, Amazon CloudWatch, AWS Fault Injection Simulator |
AWS Resilience Hub provides a central place to define, validate, and track the resiliency of your AWS applications using AWS Well-Architected best practices. Customers can get a comprehensive view of their overall application portfolio resilience status, their associated resilience scores, and actionable recommendations.
This resilience score is designed to assess the readiness of your environment for resilience, not only from a technical perspective with resiliency policies and alarms, but also by validating the successful completion of your recommended Standard Operating Procedures (SOPs) and fault injection experiments with AWS Fault Injection Simulator (FIS).
In this post, I will show you how to leverage the resilience score to improve the resilience of your applications.
You will understand how to get and maintain a resilience score of 100%, and customize the recommendations made by Resilience Hub from your assessments and investigate unexpected drops in your score. Note: getting a score of 100% is not mandatory for every customer. For example, if you do not wish to use FIS experiments, then your maximum reachable score will be 80%.
We will use a multi-tier, scalable WordPress solution to illustrate how the resilience score works. The associated AWS CloudFormation templates are available in the AWS CloudFormation documentation. Choose your AWS Region (today I will be using us-west-2, Oregon) and deploy the WordPress scalable and durable template.
Prerequisites
- An AWS account.
- An application monitored by AWS Resilience Hub (you can either use the AWS CloudFormation template provided below or use your own application). Refer to Measure and Improve Your Application Resilience with AWS Resilience Hub to get started with AWS Resilience Hub.
- Basic understanding of AWS Fault Injection Simulator, Amazon CloudWatch and AWS Systems Manager.
How the AWS Resilience Hub score works
Before diving into the resilience score itself, let us review the main steps of the Resilience Hub workflow. After you add an application to Resilience Hub, you can run an assessment on the supported application components and receive resiliency and operational recommendations.
Resiliency recommendations will help you meet/optimize your RPO and RTO targets defined in your resiliency policy on multiple levels, called disruption types: Application, Infrastructure, Availability Zone (AZ) and Region.
Operational recommendations will provide you with Alarms, SOPs and FIS experiments, all deployable within minutes with AWS CloudFormation.
The resilience score is an Amazon CloudWatch metric ranging from 0 (minimum) to 100 (maximum) that is calculated every time a new assessment is run. Assessments can be run manually or on daily basis with scheduled assessments.
To get a resilience score of 100%, your application must:
- Be fully compliant with your configured resiliency policy for all the disruption types. If your application is deployed within a single region, the optional region disruption type will be ignored and will not impact your score.
- Have its recommended alarms both implemented and in the ‘OK’ or ‘Alarm – not missing data’ state.
- Have its recommended SOPs both implemented and successfully executed within the past 30 days.
- Have its recommended FIS experiments both implemented and successfully executed within the past 30 days.
As mentioned earlier, getting a score of 100% may not be a requirement for your organization. If you do not wish to implement all the recommendations provided by Resilience Hub, your actual target will be lower.
The following table shows the weight for each of these recommendations:
Recommendation type | Weight |
Meeting resiliency policy | 40 percent |
Alarms | 20 percent |
SOPs | 20 percent |
FIS experiments | 20 percent |
The resiliency policy recommendation, which accounts for 40% of the total resilience score, is calculated based on the disruption types with the following weights:
Disruption type | Weight |
Region | 10 percent |
Availability Zone | 20 percent |
Infrastructure | 30 percent |
Application | 40 percent |
Any non-compliant disruption type, triggered alarm or failed SOP/FIS experiment will result in partial points being granted during the assessment.
For more information on the resilience score, please refer to the Resilience Hub documentation.
Example: Improving the resilience of a multi-tier application using the AWS Resilience Hub score
In this example I have deployed a multi-tier WordPress application through CloudFormation and added the resulting stack to Resilience Hub. For this scenario I am using a suggested resiliency policy named Critical Application, which is a single region policy with a 1h RPO/RTO for the Infrastructure and Availability Zone disruption type, and 1h RPO / 4h RTO for the Application disruption type.
Refer to Measure and Improve Your Application Resilience with AWS Resilience Hub to get started with Resilience Hub.
Step 1: Run your first assessment
Our first step is to run our very first assessment. This assessment will look at your application components (in our case the database instance and web server group), validate our resiliency policy, and come up with actionable recommendations.
Since this is our first resiliency assessment, I am not expecting to get any points for the alarms, SOPs and FIS experiments (20% each) since the tool is just about to give me its first architectural recommendations. If your application meets your resiliency policy for all the disruption types (Application, Infrastructure, Availability Zone and Region (optional)), you can expect to get a 40% score for now.
Note: If any of the disruption types displayed in Figure 4 did not satisfy the requirements of our resiliency policy, you would only have received partial points for the resiliency policy recommendation.
Step 2: Implementing alarms, SOPs and FIS experiment templates
The assessment report includes the operational recommendations that are now deployable with CloudFormation. I recommend that you start with the alarms first, as CloudWatch alarms will be used by FIS experiment templates to validate the tests later.
The recommendations provided by Resilience Hub will be specific to your environment. Here you will notice that Resilience Hub has provided several FIS experiments to test our Amazon Relational Database Service (Amazon RDS) database, AWS Auto-Scaling Group and multi-AZ design. I have also received 10 recommended alarms and 3 SOPs (not shown here).
Step 3: Implementing prerequisites for the alarms
Some alarms will require manual configuration to work properly. For example, specific alarms may need operational metrics from your Amazon Elastic Compute Cloud (Amazon EC2) instances, like memory utilization and require a specific CloudWatch agent configuration.
You can access the setup instructions by clicking on the red ‘“Configuration” warning sign.
Step 4: Customizing alarms, SOPs and FIS experiment templates
You may need to customize your recommendation settings to get a proper resilience strategy that fits your environment. Take some time to review and customize your alarms and FIS experiment templates based on your requirements.
For example, you may want to extend the duration of your stress tests, terminate a specific process, or update the expected recovery time in your FIS experiment template.
Step 5: Validating alarms, SOPs and FIS experiments
Now that you have deployed and configured all the recommendations provided by Resilience Hub, you will need to successfully run your SOPs and FIS experiments to increase your score. Note that your CloudWatch alarms must also be in the ‘OK’ or ‘Alarm – not missing data’ state to receive the maximum resilience score for your application.
Your resilience score will update on the main dashboard after your next assessment.
You will need to run your SOPs and FIS experiments at least every 30 days to keep your resilience score from drifting.
Troubleshooting a drifting resilience score
Resilience Hub is a service that can be used to frequently assess the resilience of your infrastructure, the status of your SOPs and FIS experiments. Achieving a score of 100% is an important first step, but you need to remember that without proper maintenance your score may decrease over time.
Here are some of the common explanations for a drifting resilience score:
- Your application is no longer meeting your resiliency policy: check the resiliency recommendations section of your latest assessment to learn more or verify that your resiliency policy was not updated by another administrator.
- One or more of your SOPs or FIS experiments have failed to complete: it is crucial for an application to continue to operate after unexpected events. If your application is taking too much time to scale out, recover, or stops operating during the test campaign, your experiments will fail and your score will decrease.
- You have not run one or more SOPs or FIS experiment in the past 30 days: it is important to periodically test your resiliency strategy to confirm that your security mechanisms are able to prevent issues proactively and remain up-to-date.
- One or more of your alarms have been triggered: you will need to investigate in your application or potentially customize your alarm settings to make them relevant to your environment.
- New recommendations are available in your latest assessment or Resilience Hub may have new alarms, SOPs or FIS tests as your application is evolving and growing. Check the operational recommendations section of your latest assessment and confirm that nothing is in the “Not implemented” state.
Cleanup
If you deployed a test application to discover Resilience Hub, do not forget to delete any existing resources to avoid unnecessary charges.
- Remove your application from the Resilience Hub dashboard.
- Delete the CloudFormation stacks (alarms, SOPs, FIS experiments) deployed from Resilience Hub
- If you used the multi-tier WordPress infrastructure, delete the CloudFormation template that deployed your application.
- Delete your remaining AWS resources that you implemented to run the recommendations: AWS Simple Notification Service (SNS) topics, AWS CloudWatch canaries etc.
Conclusion
Having good visibility on your application resilience mechanisms and actionable tools to validate your strategy is critical to keep your services operational over time. Assessing your applications and testing your Standard Operating Procedures (SOPs) periodically will help you keep your resilience posture up-to-date and validated.
In this blog post we saw how the resilience score can help you quickly understand the status of your resilience strategy. We learnt how the score is calculated, how to maximize it and troubleshoot drifting scores.
Let us know your feedback and get started with AWS Resilience Hub today.
About the author: