AWS Cloud Operations Blog

Implementing recommended experiments using the AWS Resilience Hub console

Amazon Web Services (AWS) is excited to introduce an enhanced integration between AWS Resilience Hub and AWS Fault Injection Service for facilitating the process of creating and running chaos experiments. We’ll focus on how to leverage this integration through the AWS Management Console, offering a user-friendly, point-and-click approach. The console interface is ideal for those who prefer visual workflows or are new to chaos engineering with AWS services.

AWS Resilience Hub provides comprehensive resilience assessments and resilience scores for applications, supporting a wide range of AWS services across compute, storage, database, and networking categories. It offers custom recommendations to fortify applications against potential failures and optimizes recovery processes. AWS Fault Injection Service complements this by enabling the execution of various real-world failure scenarios in a controlled environment. Together, these services form a powerful duo for enhancing application resilience.

This integration streamlines the creation and execution of fault injection experiments tailored to address specific resilience challenges based on your application architecture. Through the AWS Management Console, you can access experiment recommendations, initiate tests, and view results with just a few clicks. This guided, low-friction approach to chaos engineering is beneficial for both newcomers and experienced users who prefer a graphical interface. The console provides an intuitive way to track your application’s resilience score over time and seamlessly create and initiate fault injection experiments.

Let’s explore how to leverage this new integration using the AWS Management Console.

Prerequisites

For the purposes of this example, we will use an architecture with multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. We are going to assume you’re already familiar with the process of onboarding your application to AWS Resilience Hub and running an initial resilience assessment. If you need a refresher on these steps, please refer to the AWS Resilience Hub add an application documentation.

As a quick recap, we have already:

  • Onboarded the application to AWS Resilience Hub
  • Run an initial resilience assessment
  • Received a resilience score and recommendations

AWS Resilience Hub presents the application’s resiliency score (Figure 1), which in the example is 40 out of 100. While this might seem low at first glance, it’s important to understand what this score represents. The resiliency score reflects the preparedness for potential disruptions and highlights specific actions we can take to improve.

A higher score doesn’t guarantee immunity from issues, but it indicates we’ve implemented more protective measures.

The current score may not be perfect, but that’s okay—we’re on a path of continuous improvement. By focusing on increasing this score, we’re actively fortifying the systems against potential disruptions.

The AWS Resilience Hub dashboard displays a comprehensive view of application resilience. The current Resiliency Score of 40 out of 100 is displayed. To the right, a line graph illustrates the Resiliency score trend over time, allowing users to track improvements or regressions. On the left side, an Action Items panel lists recommendations. This layout provides a clear, at-a-glance summary of the application's current resilience status and areas for improvement.

Figure 1: AWS Resilience Hub Resiliency score dashboard

As shown in Figure 1, AWS Resilience Hub has identified several opportunities to enhance the application’s resilience through a breakdown of the following Action items:

  • Alarms: These will help to monitor the application more effectively and respond quickly to potential issues.
  • Standard Operating Procedures (SOPs): These will guide the team in handling specific scenarios, confirming consistent and efficient responses.
  • AWS Fault Injection Service Experiments: These will allow proactive testing of the application’s resilience under various failure scenarios.

For those interested in implementing the recommended alarms and SOPs, we encourage you to check out the comprehensive Using AWS Resilience Hub documentation. These additional measures, combined with fault injection testing, will provide a holistic approach to enhancing your application’s resilience.

Exploration of the new integration

For this example, we’ll concentrate on the AWS Fault Injection Service experiments to demonstrate the new integration between AWS Resilience Hub and AWS Fault Injection Service. These experiments are particularly valuable because they allow us to:

  • Test the application’s response against real-world failure scenarios in a controlled environment
  • Identify potential weaknesses in the application before they become real problems
  • Validate the application’s ability to withstand and recover from various types of disruptions
  • Gain insights to inform further improvements in the application’s architecture, configuration, monitoring capabilities and alarm mechanisms

Let’s begin by reviewing the Operational recommendations section, focusing on the fault injection experiments. As you navigate to this area, you’ll notice new types of experiments recommended. AWS Resilience Hub has streamlined the process of creating and executing experiments, eliminating the need for AWS CloudFormation templates.

Now it suggests native AWS Fault Injection Service actions that can be implemented directly through the AWS Fault Injection Service console. To demonstrate this new capability, we’ll walk through the implementation of the first experiment recommendation: aws:ec2:stop-instances (Figure 2).

The Fault injection experiments dashboard, located within the Operational recommendations tab of the Assessment Report. The dashboard displays a list of recommended experiments, each with an Action name, state and description. Each of the experiments have a status of Not Implemented.

Figure 2: Fault injection experiments dashboard before running fault injection experiment

The aws:ec2:stop-instances fault injection experiment is an experiment to evaluate how the application responds to the sudden loss of Amazon EC2 instances. This scenario mimics real-world disruptions, such as unexpected instance failures or scheduled maintenance events, that can impact the availability and performance of the application.

By running this experiment, we’ll be able to assess the application’s ability to gracefully handle the loss of compute resources. We’ll also be able to validate the auto-scaling and failover mechanisms, and identify any potential bottlenecks or single points of failure.

Upon selecting the aws:ec2:stop-instances Action Name, we are directed to a detailed dashboard (Figure 3). Here, we select the available AppComponents that we can include in our experiment. In this example, all three Amazon EC2 instances are part of the same AppComponents. Once we’ve made the selection, we can click the Initiate experiment button directly from the dashboard.

The action experiment dashboard displays a list of AppComponents available for the fault injection experiment. Each instance is represented by a row with details such as name, state, resources and target selection. A prominent Initiate experiment button is visible, allowing users to launch the creation of the selected experiment directly from the dashboard.

Figure 3: Action experiment dashboard

Clicking on the initiate experiment button will open a new tab, presenting us with an AWS Fault Injection Service experiment template input form. Step 1 is where we start specifying the template details (Figure 4).

The AWS Fault Injection Service experiment template input form displays various configuration fields. Step 1 of the form includes description, name and experiment type. There is a Next button to continue to the next step.

Figure 4: Step 1 – AWS Fault Injection Service experiment template

Notice how in Step 2 the Actions and Targets have been automatically pre-populated based on the information from AWS Resilience Hub (Figure 5). This integration streamlines the workflow, saving time and reducing the potential for errors.

The AWS Fault Injection Service experiment configuration screen showcasing pre-filled actions and targets. The Actions section displays the aws:ec2:stop-instances action, while the Targets section lists the specific Amazon EC2 instances selected from AWS Resilience Hub. Each pre-populated field is clearly marked, demonstrating the seamless data transfer between AWS Resilience Hub and AWS Fault Injection Service. Options to edit or add additional Actions and Targets are visible, allowing for further customization if needed.

Figure 5: Pre-populated AWS Fault Injection Service Experiment Actions and Targets Configuration

Let’s examine the pre-filled Action information (Figure 6) to gain a clear understanding of the experiment’s parameters. An Action defines the specific disruption or fault to be introduced during an experiment. The Action section comes pre-populated with the following details:

  • Name: Identifying the specific action
  • Action type: Specifying what AWS Fault Injection Service will do (in this case, stopping Amazon EC2 instances)
  • Target: Indicating which resources will be affected

Additionally, the Start instances after duration is preset to five minutes. This setting determines how long the instances will remain stopped before automatically restarting. While this default duration is often suitable, you can adjust it based on your specific testing needs or application characteristics.

The AWS Fault Injection Service Action configuration dialog displays pre-filled information for an Amazon EC2 instance stop experiment. The dialog shows three main pre-filled sections: Name (identifying the specific action), Action type (set to aws:ec2:stop-instances), and Target (indicating the affected Amazon EC2 instances). Below these, a Start instances after duration field is pre-set to five minutes. All values are adjustable. The pre-population of these fields demonstrates the intelligent integration between AWS Resilience Hub and AWS Fault Injection Service, streamlining the experiment setup process while allowing for customization.

Figure 6: Pre-populated AWS Fault Injection Service Experiment Action Configuration

Next, let’s examine the pre-filled Target information (Figure 7). The Target section identifies which specific resources will be affected by the experiment. The Target section comes pre-populated with the following details:

  • Name: A unique identifier for this target
  • Resource type: Specifies the type of AWS resource (in this case, Amazon EC2 instances)
  • Target method: Indicates how the resources are selected
  • Resource IDs: Lists the specific resources to be targeted

Notably, the Resource IDs field is pre-selected with one of the resources defined in the AppComponents earlier. However, for a more comprehensive test, you have the flexibility to add all relevant resources to this experiment. This allows you to tailor the scope of your test to best suit an application’s architecture and your resilience testing goals.

It’s important to note that while AWS Resilience Hub automatically handles Target selection for many resources, manual selection is required for certain architectures. If the architecture is using Amazon Elastic Container Service or Amazon Elastic Kubernetes Service for compute, you will need to specify tags or parameters in the Target template dialog. This manual approach is necessary when AWS Fault Injection Service requires a tag instead of an ARN as a Target. This flexibility makes sure that your AWS Fault Injection Service experiments can accurately reflect your application’s structure, regardless of the underlying architecture.

The AWS Fault Injection Service Target configuration dialog displays pre-populated information for an Amazon EC2 instance experiment. The dialog shows four main sections: Name (a unique identifier for the target), Resource type (set to aws:ec2:instance), Target method (indicating how resources are selected), and Resource IDs (listing specific Amazon EC2 instances). The Resource IDs field shows one pre-selected Amazon EC2 instance from the previously defined AppComponents, with an option to add more resources. This pre-filled Target configuration demonstrates the seamless integration between AWS Resilience Hub and AWS Fault Injection Service, while offering flexibility to expand the experiment's scope as needed.

Figure 7: Pre-populated AWS Fault Injection Service Experiment Target Configuration

With the Action and Target information pre-populated, the subsequent steps in the experiment creation process through AWS Fault Injection Service remain unchanged. You still need to specify stop conditions, configure reporting and logging options, and set up notifications as needed, aligning these settings with your specific testing objectives and organizational policies.

Once the experiment template is completed, the next steps in the resilience testing workflow also follow the established process. You would initiate the AWS Fault Injection Service experiment, setting in motion the controlled Amazon EC2 instance stoppage. As the experiment runs, you’d observe and evaluate the application’s resilience in real-time.

After the experiment has successfully completed, you’d return to AWS Resilience Hub and reassess the application. This reassessment helps to quantify improvements in the application’s resilience and identifies any remaining areas for enhancement. The ability to seamlessly modify and rerun these experiments, coupled with regular reassessments, enables ongoing refinement of your application’s resilience strategy.

This cycle of testing, observation, and reassessment continues to be the cornerstone of continuous resilience improvement.

Once we’ve reassessed the application after running the experiment, we can examine the updated resiliency score (Figure 8). This score quantifies the application’s resilience following the fault injection experiment. Comparing it to the previous score confirms that the application underwent the prescribed chaos engineering test.

An increase suggests improved resilience against the simulated failure scenario, boosting confidence in the application’s ability to handle similar real-world disruptions. Even an unchanged score provides valuable insights into the application’s current resilience level and may highlight areas for further improvement.

The goal is to gain a deeper understanding of the application’s behavior under failure modes and identify opportunities to enhance its overall resilience posture.

The AWS Resilience Hub dashboard displays an updated Resilience Score of 42 out of 100. To the right, a line graph illustrates the Resiliency score trend over time, allowing users to track improvements or regressions. On the left side, an Action Items panel lists recommendations showing a decrease in the number of fault injection experiments recommended. This layout provides a clear, at-a-glance summary of the application's current resilience status and areas for improvement.

Figure 8: Updated AWS Resilience Hub Resiliency Score

The Operational recommendations tab within the newly completed assessment reveals a change: the aws:ec2:stop-instances fault injection experiment is now marked as Implemented (Figure 9). This visual confirmation serves as a tangible record of the progress and correlates with the increase in the resiliency score. The status change from Not implemented to Implemented provides clear evidence that the experiment has been successfully executed and incorporated into the resilience testing strategy.

The Operational Recommendations tab in AWS Resilience Hub displays a list of recommended Actions. The aws:ec2:stop-instances fault injection experiment item is highlighted, showing its status has changed to Implemented. A green checkmark icon accompanies the status, visually confirming the experiment's successful execution.

Figure 9: Fault injection experiments dashboard after running fault injection experiment

Conclusion

We’ve explored the enhanced integration between AWS Resilience Hub and AWS Fault Injection Service using the AWS Management Console. We demonstrated how this user-friendly, point-and-click approach facilitates proactive application resilience testing. It assesses initial resilience, implements targeted fault injection experiments, and validates improvements.

By leveraging visual recommendations in AWS Resilience Hub and controlled testing environments through the console in AWS Fault Injection Service, you can identify and address potential weaknesses in your application architecture.

Remember, resilience is an ongoing process. Regular assessments and strategic experiments are important for maintaining resilient applications. As you apply these techniques using the intuitive console interface, you’ll be better equipped to build systems that can withstand unexpected challenges, improve your resilience score, and gain confidence in your application’s reliability, all without needing to write a single line of code.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

About the authors

Jennifer Moran
Jennifer Moran is an AWS Senior Resilience Specialist Solutions Architect. She brings a wealth of experience from her diverse technical background, encompassing various roles across the software industry. Her expertise focuses on helping customers design resilient solutions to improve their overall resilience posture.

Hechmi Khelifi

Hechmi Khelifi
Hechmi Khelifi is an Enterprise Solutions Architect at AWS, focusing on resilience and reliability. With 3+ years at AWS and a PhD from the University of Quebec, Hechmi leverages his extensive IT experience and strong academic background to help customers build robust and resilient solutions.