Best Practices from Eviden for Leveraging AWS Resilience Hub to Protect Applications

By Justin Cook, Lead Cloud Architect – Eviden, an Atos Business
By Hakan Korkmaz, Sr. Partner Solutions Architect – AWS

Eviden

Resiliency in the cloud is the ability to maintain availability and recover from software and operational disruption in a designated time frame. Sounds simple, right? It’s not that easy. This is why AWS Resilience Hub was created.

AWS Resilience Hub offers resiliency assessment and validation that integrate into your software development lifecycle (SDLC) to uncover weaknesses, helping to estimate whether or not the recovery time objective (RTO) and recovery point objective (RPO) for your applications can be met.

After you deploy an Amazon Web Services (AWS) application into production, you can use AWS Resilience Hub to continue tracking the resiliency posture of your application.

Following best practices from the AWS Well-Architected Framework, AWS Resilience Hub enables visibility for applications deployed by AWS CloudFormation, including cross-region and cross-account stacks. It also supports defining an application AppRegistry, resource groups, Terraform, and applications using Amazon Elastic Kubernetes Service (Amazon EKS).

Eviden, an Atos business, is an AWS Premier Tier Services Partner and AWS Marketplace Seller that’s a leading independent multi-cloud services company specializing in AWS cloud architecture, security, and resiliency, as well as being a global leader in data-driven, trusted, and sustainable digital transformation.

In this post, we will highlight the benefits of AWS Resilience Hub and cover the steps involved in setting up the service.

Solution Overview

AWS Resilience Hub helps improve the resiliency of your AWS applications and reduce the recovery time in the event of outages.

With Resilience Hub, you can describe your AWS applications, create effective resiliency policies, manage assessments that indicate the resiliency of your applications, and manage alarms, standard operating procedures (SOPs), and AWS Fault Injection Simulator (FIS) to test the estimated recovery provided.

Figure 1 – General AWS Resilience Hub overview.

AWS Resilience Hub gives you a central place to define, validate, and track the resiliency of your applications. It helps optimize business continuity in addition to meeting compliance and regulatory requirements by evaluating your infrastructure and suggesting recommendations to improve resiliency. It also provides code recommendations for implementing tests, alarms, and SOPs you can deploy and run in your integration and delivery (CI/CD) pipeline.

AWS users can also test RTO and RPO targets under different conditions using AWS Fault Injection Simulator to find issues before they occur in production.

Prerequisites

For this walkthrough, you should have the following prerequisites:

AWS account
AWS resources
Any third-party software or hardware
Any specialized knowledge

Augment Your Insights with AWS Resilience Hub

AWS Resilience Hub provides a comprehensive view of your overall application portfolio’s resilience status through its dashboard, helping. AWS Resilience Hub aggregates and organizes resilience events, alerts, and insights from services like Amazon CloudWatch, Amazon Route 53 Application Recovery Controller, and AWS FIS.

It also generates a resilience score, which is a scale that indicates the level of implementation for recommended configuration improvements, resilience tests, alarms, and recovery SOPs. This score can be used to measure resilience improvements over time.

Note that after you deploy an application into production, you can add Resilience Hub to your CI/CD pipeline to validate every build before it’s released into production. Take a look in AWS Resilience Hub Tools in GitHub for examples of this.

AWS Resilience Hub includes the application and its components, compliance status, policy met, policy breaches, not assessed, changes detected, resiliency assessment, score, and more. Resilience Hub documentation provides information about these concepts.

Another crucial aspect of resiliency is gaining an understanding of disruption types and ensuring resilience against outages. This includes assessing factors like estimated workload RTO and RPO within your application and infrastructure, as well as preparing for systemic failures within AWS regions or Availability Zones (AZs).

Fault injection experiments let you inject a failure, verify that alarms can detect an outage, and work to recover the application from the outage. You can test different application configurations and measure whether the output RTO and RPO meet the objectives defined in your policy.

Walkthrough

An AWS Resilience Hub application is a collection of AWS resources structured to track workloads or a set of resources, and it provides guidance to improve resiliency. To describe a Resilience Hub application, provide an application name, resources from one or more (up to five) AWS CloudFormation stacks, and an appropriate resiliency policy. You can use any existing Resilience Hub application as a template to describe your application.

Next, publish your application so you can run a resiliency assessment on it, receiving recommendations from the assessment to improve resiliency. Then, repeat the process until your estimated workload RTO and RPO meet your goals.

Let’s break AWS Resilience Hub into the following actions:

Describe your applications using CloudFormation, including cross-region and cross-account stacks. Applications can also be defined using resource groups or chosen from applications that are already defined in the Service Catalog AppRegistry.
Define the resilience policies for your applications. These policies include RTO and RPO targets for applications, infrastructure, Availability Zone, and region disruptions.
Assess using best practices from the AWS Well-Architected Framework to analyze the components of an application and uncover potential resilience weaknesses. These can be caused by incomplete infrastructure setup or misconfigurations.
Recommend improvements to resiliency since the assessment also generates code snippets that help you create recovery procedures as AWS Systems Manager documents for your applications, referred to as SOPs. Resilience Hub also generates a list of recommended CloudWatch monitors and alarms to help the operator identify any change to the application’s resilience posture once deployed.
Validate your application can meet its resilience targets before releasing it into production after the application and SOPs have been updated to incorporate recommendations from the assessment. This includes AWS FIS, which is a chaos engineering service that provides fault-injection simulations of real-world failures.
Resilience Hub also provides APIs so you can integrate its resilience assessment and testing into your CI/CD pipelines for ongoing resilience validation, ensuring changes to the application’s underlying infrastructure do not compromise resilience.

Note that AWS Trusted Advisor now inspects and provides a resilience score and indications of meeting or breaching an application’s resilience policy (RTO/RPO). With the resiliency checks from AWS Trusted Advisor, you can see which applications have risks and address them in Resilience Hub.

Remember to review the resiliency and operational recommendations for the application you published from the Review page. This page displays the application assessment overview, RTO and RPO summary, and disruption type details, as follows:

Overview: The overview section contains information such as application name, attached policy name, assessment Amazon Resource Name (ARN), and the assessment creation date.
RTO and RPO summaries: The summaries display the targeted time against estimated time assessed.
Details: The details section lists the disruption type, application component, and estimated RTO and RPO times tested against the attached policy configurations.

Setting Up a Standard Operating Procedure

The SOP manages recovery procedures that are based on the outage type and application components in the application.

Note that Resilience Hub does not evaluate the following types of resources:

Resources that don’t affect RTO or RPO: Resources such as AWS::RDS::DBParameterGroup, which never affects RTO or RPO and is always ignored by Resilience Hub.
Non-top-level resources: Resilience Hub only imports top-level resources, because they can derive other properties by querying the properties of top-level resources and their subcomponents, such as an Amazon Relational Database Service (Amazon RDS) cluster and its instances.

See the documentation for supported AWS resilience Hub resources.

Managing Alarms

When you run a resiliency assessment, Resilience Hub recommends setting up Amazon CloudWatch alarms to monitor your application resiliency. If the resources and components in your application change, you should run a resiliency assessment to ensure you have the correct alarms for your updated application.

To view recommended alarms:

From the resiliency assessment, select alarms you’d like to set up for your application, and choose the Create CloudFormation template.
Resilience Hub creates a CloudFormation template that contains details to create the selected alarms in CloudWatch. Once the template is generated, you can access it through an Amazon Simple Storage Service (Amazon S3) URL. You can download and make any updates, and then place it into your code pipeline or create a stack through the CloudFormation console.
RTO/RPO is a metric you should always monitor with alarms.

Figure 2 – Customer application estimated RTO and RPO.

Viewing a Resilience Hub Application Summary

The application summary’s Details section shows a summary of the selections for the application:

Resiliency policy: Shows the name of the resiliency policy attached to your application.
Description: The description of the application.
Status: Indicates if the policy is active or inactive.
Creation time: The date and time the application was created.
Version: Indicates whether the application is released or in draft.
Scheduled assessment: Indicates whether the daily assessment is active or inactive.

Getting Started

Here are the recommended steps to getting started with AWS Resilience Hub:

Describing Your Application

When you describe the application, import resources through one of the resource collection methodologies—CloudFormation stacks, Terraform state files, resource groups, or an AppRegistry—to form the structural basis of an application in Resilience Hub.

For applications using Amazon EKS, you have the option to define an application as an EKS cluster or to include an EKS cluster as part of a larger application in conjunction with a resource collection. Then, you attach a resiliency policy to the application.

A Resilience Hub resiliency policy contains the information and objectives that are used to assess whether your application can recover from a disruption type, such as software or hardware disruption. When you create a resiliency policy, you define RTO and RPO.

Running a Resiliency Assessment

After you describe your application and attach a resiliency policy to it, run a resiliency assessment. This evaluates your application configuration against the resiliency policy that’s attached to the application and generates a report. The report shows how your application measures against the objectives in your resiliency policy, and you’ll receive recommendations to improve resiliency.

Recommendations include configurations of components, alarms, tests, and recovery SOPs. Then, run another assessment and compare the results with the previous report to see how much resiliency improves. Repeat this process until your RTO and RPO workload estimates meet your goals.

Testing and Measuring Resiliency

Run tests to measure the resiliency of your AWS resources and the amount of time it takes to recover from application, infrastructure, Availability Zone, and region outages. To measure resiliency, these tests simulate outages of your AWS resources. Examples of outages include network unavailable errors, failovers, stopped processes, Amazon RDS boot recovery, and problems with your AZ.

When the test concludes, you can determine whether an application can recover from the outage types defined in the RTO in the resiliency policy.

Tracking Resiliency Over Time

View and track your application resiliency over time. After you deploy an AWS application into production, you can use Resilience Hub to continue tracking the resiliency posture of the application.

Conclusion

Viewing how your applications are protected from disruptions in a central place that can define, validate, and track resilience reduces outages is critical. With AWS Resilience Hub, users can evaluate resilience targets (RTO/RPO), identify and resolve issues before they occur in production, and optimize business continuity while reducing recovery costs is essential.

Learn more about AWS Resilience Hub, and feel free to try it out in the AWS Lab. You can also learn more abut Eviden in AWS Marketplace.

.

.

Eviden, an Atos Business – AWS Partner Spotlight

Eviden is an AWS Premier Tier Services Partner and leading independent multi-cloud services company specializing in AWS cloud architecture, security, and resiliency.

Contact Eviden | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog