AWS Cloud Operations Blog

Shared Responsibility with AWS Resilience Hub

AWS Resilience Hub is an AWS service designed to help you define, track, and manage the resilience of your applications. This service helps you understand and improve the resilience of your workloads using AWS Well-Architected best practices, and offers both resilience and operational recommendations to enable you, the customer, to consistently meet your organizational and workload-based requirements for Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

In this blog post, we are going to look at the shared responsibility model for resilience, and how this affects considerations when using the AWS Resilience Hub service.

The Shared Responsibility Model

For a deeper dive into the shared responsibility model for resiliency, I recommend reading the whitepaper Disaster Recovery of Workloads on AWS: Recovery in the Cloud. The diagram below illustrates the Customer and AWS shared responsibilities, where AWS is responsible for resilience of the cloud, and the Customer is responsible for resilience in the cloud.

Resilience is a shared responsibility between AWS and the customer

Figure 1 – Resilience is a shared responsibility between AWS and the customer

This model reflects the considerations that you need to have when using the AWS Resilience Hub service, both for operational, and resilience based recommendations. Below we will go into these different aspects with specific examples taken from the architecture used in the Resilience Hub Workshop. For more information and a self-paced walkthrough of the architecture and AWS Resilience Hub usage, you can follow the guide there.

The example architecture from the AWS Resilience Hub Workshop

Figure 2 – The example architecture from the AWS Resilience Hub Workshop

The architecture being used here is a 3-tier architecture consisting of an application load balancer, a fleet of EC2 instances in an EC2 Auto Scaling group, and an RDS database. The EC2 instances have outbound connectivity through a NAT Gateway and static assets for the application are stored in an S3 bucket.

Resilience Recommendations

After assessing the application in AWS Resilience Hub, the service gives a number of resilience recommendations across the different workload components in our architecture, in order for us to meet the RTO/RPO goals that we set. To achieve all the full RTO/ RPO policy, we will need to address all the component recommendations. A full list of potential disruption types that can be managed within the AWS Resilience Hub service can be found in the Managing resiliency policies documentation. For this blog post, we will focus on the two optional recommendations for the database component of the workload. Option 1 optimizes for both minimal changes and for cost, whereas Option 2 optimizes for the best AZ RTO/RPO. Customers must make an architectural decision on which option they wish to take to meet the RTO/RPO needs of the organization and workload. Both options will meet the currently set policy of the application defined in AWS Resilience Hub.

The resilience recommendations for the database component

Figure 3 – The resilience recommendations for the database component

In this example, we took the decision to implement Option 1. Even though the database component was already meeting the resilience Policy for RTO and RPO, as you can see, the recommendation to optimize further for AZ RTO/RPO exists within your new assessment, meaning you can achieve an even more optimized RTO/RPO if you wish.

The resilience recommendations for the database component after implementing recommendations

Figure 4 – The resilience recommendations for the database component after implementing recommendations

Operational Recommendations

Alarms

Here we will take a look at two areas of customer responsibility when looking at the Alarms section of the service under operational recommendations.

  • Additional configuration required for recommended alarms
  • Alarm requirements not covered by the recommendation engine

Additional configuration required for recommended alarms

Some recommended Alarms that AWS Resilience Hub recommends will require additional configuration. In the below example, we can see that this Amazon CloudWatch alarm  requires CloudWatch Synthetics, CloudWatch Synthetics can be used to create canaries, configurable scripts that run on a schedule, to monitor your endpoints and APIs. Details of the exact requirements can be found in the Prerequisites section as shown below.

Alarm requirements not covered by the recommendation engine

When you run a resilience assessment, AWS Resilience Hub recommends setting up Amazon CloudWatch alarms to monitor your application resilience. These alarms are not exhaustive and a full review of your application monitoring needs should always be performed to make sure you are implementing full monitoring coverage of your application. You can use the AWS Well-Architected Framework as a guide to meet best practices. REL 6: How do you monitor workload resources?

The prerequisites for the Alarm

Figure 5 – The prerequisites for the Alarm

Standard Operating Procedures (SOPs)

A SOP is a prescriptive set of steps designed to efficiently recover your application in the event of an outage or alarm. Based on your application components, AWS Resilience Hub recommends the SOPs you should prepare.

Because all applications have differing requirements, the default list of SSM documents that AWS Resilience Hub provides will not be sufficient for all of your needs. You can, however, copy the default SSM documents and use them as a basis to create your own custom documents tailored for your application.

By adding the documents directly to your code base and making all changes there, you can ensure that the latest SOPs are deployed along with your infrastructure.

By connecting AWS Fault Injection Simulator (FIS) experiments to the SSM document, and running these in your CI/CD pipeline, you will know that your SOPs are being continually tested against your workload.

Your SOPs should be reviewed as part of your Operational Readiness Reviews (ORR) to make sure that the latest procedures are in place for your application needs. Review the whitepaper for ORR to get a more detailed overview of what that entails. You can also use the ORR custom lens as described in the ORR custom lens blog with the AWS Well-Architected Tool. You can read more on how this fits with the Well Architected framework under the Operational Excellence pillar in OPS07-BP02 Ensure a consistent review of operational readiness.

Fault Injection Simulator (FIS) Experiments

Here we will take a look at three areas of customer responsibility when looking at the FIS section of the service under operational recommendations.

  • Additional configuration required for recommended FIS experiments
  • FIS requirements not covered by the recommendation engine
  • Dependent system coverage

Additional configuration required for recommended FIS experiments

AWS Fault Injection Simulator (FIS) is a fully managed service for running fault injection experiments to improve an application’s performance, observability, and resilience. You can run fault injection experiments to measure the resilience of your AWS resources and the amount of time it takes to recover from application, infrastructure, availability zone, and AWS Region impairments. To measure resilience, these fault injection experiments simulate interruptions to your AWS resources. Examples of interuptions include network unavailable errors, failovers, stopped processes on EC2/ASG, boot recovery in Amazon RDS, and problems with your Availability Zone. When the fault injection experiment concludes, you can determine whether an application can recover from the interruption types defined in the RTO target of the resilience policy.

Some of the FIS experiments that AWS Resilience Hub recommends will require additional configuration. In the below example, we can see that this FIS experiment requires an existing CloudWatch alarm. Details of the exact requirements can be found in the notice given by AWS Resilience Hub.

Additional FIS configuration

Figure 6 – Additional FIS configuration

FIS requirements not covered by the recommendation engine

The experiments listed by AWS Resilience Hub are not exhaustive. You will need to assess available experiments against the workload requirements. In this example, we have experiment recommendations for S3, ASG, RDS and a network experiment against the load balancer. There may be other experiments you want to perform, for example you may wish to see how your application deals with EBS I/O pause.

FIS experiments

Figure 7 – FIS experiments

Dependency coverage

Lastly, your workload may be reliant on other dependencies within your organization. You should negotiate between dependent teams on what experiments are required for you to build resilient systems. AWS Resilience Hub can recommend experiments on the individual workloads involved, but the dependency aspects of your experiments will need to be properly assessed and implemented by the customer. A good example of this can be found in the Well-Architected pillar OPS04.

Operational Integration

Further to the considerations we discussed above, there are a number of additional considerations that you must take into account.

Additional Operational Requirements

As discussed in previous sections, the AWS Resilience Hub operational recommendations are not exhaustive. If additional Alarms, SOPs and FIS experiments are required, it is your responsibility as the customer to create and maintain these outside of AWS Resilience Hub.

Use the templates to integrate into the workload, not standalone

The operational recommendations from AWS Resilience Hub should be integrated into your application. Hardcoded resources can be replaced by customer teams to be dynamic. The AWS Resilience Hub documentation has CloudFormation examples outlining how this can be done: Integrating operational recommendations into your application with AWS CloudFormation

Use templates as a start for standardized strategy to resilience

If you are using AWS Resilience Hub as part of the start of your resilience journey, it’s important to standardize practices. Alarms, SOPs and FIS Experiments should form a part of your overall resilience strategy within both individual teams, and the wider organization. Including AWS Resilience Hub to your existing or developing ORR can help you define strategies for resilience.

Continually check for new recommendations

New recommendations, both for resilience and operations, are added to AWS Resilience Hub periodically as the service grows and adds support for additional AWS services. It is your responsibility as the customer to continually assess and review the requirements of your workloads as part of periodic ORRs. This includes additional AWS Resilience Hub functionality. Tracking the AWS Resilience Hub application resilience score can also inform you whether your Alarms, SOPs and FIS Experiments are being periodically tested, as detailed in this blog post, How to use the AWS Resilience Hub score.

Conclusion

AWS Resilience Hub enables customers at different stages of their resilience journey to define, track and manage the resilience of their workloads and applications. Customers need to define what additional requirements and resources need to be implemented in addition to the AWS Resilience Hub recommendations to meet both the expectations of their organization and workload, but to also cover their responsibilities for resilience in the cloud.

Jamie Ibbs

Jamie Ibbs is a Specialist Technical Account Manager with AWS, where he helps customers to operate at scale, with a particular interest in management, governance, and resiliency.