AWS Cloud Operations Blog
Implementing recommended experiments using the AWS Resilience Hub console
Amazon Web Services (AWS) is excited to introduce an enhanced integration between AWS Resilience Hub and AWS Fault Injection Service for facilitating the process of creating and running chaos experiments. We’ll focus on how to leverage this integration through the AWS Management Console, offering a user-friendly, point-and-click approach. The console interface is ideal for those who prefer visual workflows or are new to chaos engineering with AWS services.
AWS Resilience Hub provides comprehensive resilience assessments and resilience scores for applications, supporting a wide range of AWS services across compute, storage, database, and networking categories. It offers custom recommendations to fortify applications against potential failures and optimizes recovery processes. AWS Fault Injection Service complements this by enabling the execution of various real-world failure scenarios in a controlled environment. Together, these services form a powerful duo for enhancing application resilience.
This integration streamlines the creation and execution of fault injection experiments tailored to address specific resilience challenges based on your application architecture. Through the AWS Management Console, you can access experiment recommendations, initiate tests, and view results with just a few clicks. This guided, low-friction approach to chaos engineering is beneficial for both newcomers and experienced users who prefer a graphical interface. The console provides an intuitive way to track your application’s resilience score over time and seamlessly create and initiate fault injection experiments.
Let’s explore how to leverage this new integration using the AWS Management Console.
Prerequisites
For the purposes of this example, we will use an architecture with multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. We are going to assume you’re already familiar with the process of onboarding your application to AWS Resilience Hub and running an initial resilience assessment. If you need a refresher on these steps, please refer to the AWS Resilience Hub add an application documentation.
As a quick recap, we have already:
- Onboarded the application to AWS Resilience Hub
- Run an initial resilience assessment
- Received a resilience score and recommendations
AWS Resilience Hub presents the application’s resiliency score (Figure 1), which in the example is 40 out of 100. While this might seem low at first glance, it’s important to understand what this score represents. The resiliency score reflects the preparedness for potential disruptions and highlights specific actions we can take to improve.
A higher score doesn’t guarantee immunity from issues, but it indicates we’ve implemented more protective measures.
The current score may not be perfect, but that’s okay—we’re on a path of continuous improvement. By focusing on increasing this score, we’re actively fortifying the systems against potential disruptions.
Figure 1: AWS Resilience Hub Resiliency score dashboard
As shown in Figure 1, AWS Resilience Hub has identified several opportunities to enhance the application’s resilience through a breakdown of the following Action items:
- Alarms: These will help to monitor the application more effectively and respond quickly to potential issues.
- Standard Operating Procedures (SOPs): These will guide the team in handling specific scenarios, confirming consistent and efficient responses.
- AWS Fault Injection Service Experiments: These will allow proactive testing of the application’s resilience under various failure scenarios.
For those interested in implementing the recommended alarms and SOPs, we encourage you to check out the comprehensive Using AWS Resilience Hub documentation. These additional measures, combined with fault injection testing, will provide a holistic approach to enhancing your application’s resilience.
Exploration of the new integration
For this example, we’ll concentrate on the AWS Fault Injection Service experiments to demonstrate the new integration between AWS Resilience Hub and AWS Fault Injection Service. These experiments are particularly valuable because they allow us to:
- Test the application’s response against real-world failure scenarios in a controlled environment
- Identify potential weaknesses in the application before they become real problems
- Validate the application’s ability to withstand and recover from various types of disruptions
- Gain insights to inform further improvements in the application’s architecture, configuration, monitoring capabilities and alarm mechanisms
Let’s begin by reviewing the Operational recommendations section, focusing on the fault injection experiments. As you navigate to this area, you’ll notice new types of experiments recommended. AWS Resilience Hub has streamlined the process of creating and executing experiments, eliminating the need for AWS CloudFormation templates.
Now it suggests native AWS Fault Injection Service actions that can be implemented directly through the AWS Fault Injection Service console. To demonstrate this new capability, we’ll walk through the implementation of the first experiment recommendation: aws:ec2:stop-instances (Figure 2).
Figure 2: Fault injection experiments dashboard before running fault injection experiment
The aws:ec2:stop-instances fault injection experiment is an experiment to evaluate how the application responds to the sudden loss of Amazon EC2 instances. This scenario mimics real-world disruptions, such as unexpected instance failures or scheduled maintenance events, that can impact the availability and performance of the application.
By running this experiment, we’ll be able to assess the application’s ability to gracefully handle the loss of compute resources. We’ll also be able to validate the auto-scaling and failover mechanisms, and identify any potential bottlenecks or single points of failure.
Upon selecting the aws:ec2:stop-instances Action Name, we are directed to a detailed dashboard (Figure 3). Here, we select the available AppComponents that we can include in our experiment. In this example, all three Amazon EC2 instances are part of the same AppComponents. Once we’ve made the selection, we can click the Initiate experiment button directly from the dashboard.
Figure 3: Action experiment dashboard
Clicking on the initiate experiment button will open a new tab, presenting us with an AWS Fault Injection Service experiment template input form. Step 1 is where we start specifying the template details (Figure 4).
Figure 4: Step 1 – AWS Fault Injection Service experiment template
Notice how in Step 2 the Actions and Targets have been automatically pre-populated based on the information from AWS Resilience Hub (Figure 5). This integration streamlines the workflow, saving time and reducing the potential for errors.
Figure 5: Pre-populated AWS Fault Injection Service Experiment Actions and Targets Configuration
Let’s examine the pre-filled Action information (Figure 6) to gain a clear understanding of the experiment’s parameters. An Action defines the specific disruption or fault to be introduced during an experiment. The Action section comes pre-populated with the following details:
- Name: Identifying the specific action
- Action type: Specifying what AWS Fault Injection Service will do (in this case, stopping Amazon EC2 instances)
- Target: Indicating which resources will be affected
Additionally, the Start instances after duration is preset to five minutes. This setting determines how long the instances will remain stopped before automatically restarting. While this default duration is often suitable, you can adjust it based on your specific testing needs or application characteristics.
Figure 6: Pre-populated AWS Fault Injection Service Experiment Action Configuration
Next, let’s examine the pre-filled Target information (Figure 7). The Target section identifies which specific resources will be affected by the experiment. The Target section comes pre-populated with the following details:
- Name: A unique identifier for this target
- Resource type: Specifies the type of AWS resource (in this case, Amazon EC2 instances)
- Target method: Indicates how the resources are selected
- Resource IDs: Lists the specific resources to be targeted
Notably, the Resource IDs field is pre-selected with one of the resources defined in the AppComponents earlier. However, for a more comprehensive test, you have the flexibility to add all relevant resources to this experiment. This allows you to tailor the scope of your test to best suit an application’s architecture and your resilience testing goals.
It’s important to note that while AWS Resilience Hub automatically handles Target selection for many resources, manual selection is required for certain architectures. If the architecture is using Amazon Elastic Container Service or Amazon Elastic Kubernetes Service for compute, you will need to specify tags or parameters in the Target template dialog. This manual approach is necessary when AWS Fault Injection Service requires a tag instead of an ARN as a Target. This flexibility makes sure that your AWS Fault Injection Service experiments can accurately reflect your application’s structure, regardless of the underlying architecture.
Figure 7: Pre-populated AWS Fault Injection Service Experiment Target Configuration
With the Action and Target information pre-populated, the subsequent steps in the experiment creation process through AWS Fault Injection Service remain unchanged. You still need to specify stop conditions, configure reporting and logging options, and set up notifications as needed, aligning these settings with your specific testing objectives and organizational policies.
Once the experiment template is completed, the next steps in the resilience testing workflow also follow the established process. You would initiate the AWS Fault Injection Service experiment, setting in motion the controlled Amazon EC2 instance stoppage. As the experiment runs, you’d observe and evaluate the application’s resilience in real-time.
After the experiment has successfully completed, you’d return to AWS Resilience Hub and reassess the application. This reassessment helps to quantify improvements in the application’s resilience and identifies any remaining areas for enhancement. The ability to seamlessly modify and rerun these experiments, coupled with regular reassessments, enables ongoing refinement of your application’s resilience strategy.
This cycle of testing, observation, and reassessment continues to be the cornerstone of continuous resilience improvement.
Once we’ve reassessed the application after running the experiment, we can examine the updated resiliency score (Figure 8). This score quantifies the application’s resilience following the fault injection experiment. Comparing it to the previous score confirms that the application underwent the prescribed chaos engineering test.
An increase suggests improved resilience against the simulated failure scenario, boosting confidence in the application’s ability to handle similar real-world disruptions. Even an unchanged score provides valuable insights into the application’s current resilience level and may highlight areas for further improvement.
The goal is to gain a deeper understanding of the application’s behavior under failure modes and identify opportunities to enhance its overall resilience posture.
Figure 8: Updated AWS Resilience Hub Resiliency Score
The Operational recommendations tab within the newly completed assessment reveals a change: the aws:ec2:stop-instances fault injection experiment is now marked as Implemented (Figure 9). This visual confirmation serves as a tangible record of the progress and correlates with the increase in the resiliency score. The status change from Not implemented to Implemented provides clear evidence that the experiment has been successfully executed and incorporated into the resilience testing strategy.
Figure 9: Fault injection experiments dashboard after running fault injection experiment
Conclusion
We’ve explored the enhanced integration between AWS Resilience Hub and AWS Fault Injection Service using the AWS Management Console. We demonstrated how this user-friendly, point-and-click approach facilitates proactive application resilience testing. It assesses initial resilience, implements targeted fault injection experiments, and validates improvements.
By leveraging visual recommendations in AWS Resilience Hub and controlled testing environments through the console in AWS Fault Injection Service, you can identify and address potential weaknesses in your application architecture.
Remember, resilience is an ongoing process. Regular assessments and strategic experiments are important for maintaining resilient applications. As you apply these techniques using the intuitive console interface, you’ll be better equipped to build systems that can withstand unexpected challenges, improve your resilience score, and gain confidence in your application’s reliability, all without needing to write a single line of code.
Contact an AWS Representative to know how we can help accelerate your business.
Further Reading
- Implementing recommended experiments using AWS Resilience Hub APIs
- Resilience Lifecycle Framework
- AWS best practices
- Leverage AWS Resilience Lifecycle Framework to assess and improve the resilience of application using AWS Resilience Hub
About the authors