AWS Cloud Operations Blog
Implementing recommended experiments using AWS Resilience Hub APIs
Amazon Web Services (AWS) is excited to introduce an enhanced integration between AWS Resilience Hub and AWS Fault Injection Service that streamlines the process of creating and running chaos experiments.
We’ll focuses on how to leverage this integration through the AWS Command Line Interface (AWS CLI), catering to users who prefer command-line tools for automation and scripting. The AWS CLI approach is particularly useful for DevOps teams looking to incorporate chaos engineering into their continuous integration and continuous delivery CI/CD pipelines or for those who want to automate resilience testing at scale.
AWS Resilience Hub offers in-depth resilience assessments and scores for applications, covering a broad spectrum of AWS services in compute, storage, database, and networking domains. It provides tailored recommendations to strengthen applications against potential failures and enhances recovery strategies. AWS Fault Injection Service augments this capability by facilitating the execution of diverse, real-world failure scenarios in a controlled setting. The synergy between these services creates a framework for improving application resilience.
This integration streamlines the creation and execution of fault injection experiments tailored to address specific resilience challenges based on your application architecture. Through the AWS CLI, you can programmatically access experiment recommendations, initiate tests, and retrieve results. This approach to chaos engineering is beneficial for both scripting enthusiasts and experienced users who want to automate their resilience testing processes. The AWS CLI provides powerful commands to track your application’s resilience score over time and efficiently create and initiate fault injection experiments, all from the command line.
Overview
The API integration between AWS Resilience Hub and AWS Fault Injection Service has been enhanced to streamline the process of creating recommended fault injection experiments. Previously, the workflow did not utilize the AWS Fault Injection Service API directly. This indirect approach added extra steps to the process of implementing fault injection experiment recommendations.
The following flow outlines the standard steps for assessing and creating fault injection experiments, with Steps 2 and 3 highlighting the new, enhanced API integration:
- Initial Resilience Assessment: Utilize AWS Resilience Hub to evaluate your application’s resilience. This can be done through the AWS Command Line Interface or the AWS Management Console. The assessment generates recommendations across alarms, Standard Operating Procedures (SOPs), and AWS Fault Injection Service experiments. We will focus on the Fault Injection experiments. For those interested in implementing the recommended alarms and SOPs, we encourage you to check out the comprehensive Using AWS Resilience Hub documentation.
- NEW Enhanced API Integration – Retrieve Experiment Recommendations: Leverage the new API integration to fetch AWS Fault Injection Service experiment recommendations produced by AWS Resilience Hub using the ListTestRecommendations API. (This step eliminates the need for the two-step process that previously existed.)
- NEW Enhanced API Integration – Experiment Template Creation: With the retrieved recommendations, the API integration facilitates the creation of tailored AWS Fault Injection Service experiment templates.
- Execute and Monitor Experiments: Execute the AWS Fault Injection Service experiment using the created templates. Closely monitor the experiments to gather data on your application’s behavior under the experiment’s failure mode.
- Reassessment and Score Update: After implementing and executing the fault injection experiment, reassess your application using AWS Resilience Hub. This step confirms the effectiveness of the implemented resilience measures and provides an updated resilience score for your application.
This flow emphasizes how the new enhanced API integration streamlines the process. It translates AWS Resilience Hub fault injection experiment recommendations into actionable AWS Fault Injection Service experiments, while the overall process of assessment, testing, and reassessment remains consistent with standard practices.
AWS CLI commands
Let’s walk through the flow, illustrating each of these steps using AWS CLI commands:
We will use an architecture with multiple Amazon Elastic Compute Cloud (Amazon EC2) instances.
Step 1 – Initial Resilience Assessment
For the purposes of this example, we are going to assume you’re already familiar with the process of onboarding your application to AWS Resilience Hub and running an initial resilience assessment. If you need a refresher on these commands, please refer to the Using AWS Resilience Hub documentation.
Step 2 – NEW Enhanced API Integration – Retrieve Experiment Recommendations
After running the resilience assessment, we use the following command to retrieve the experiment recommendations generated by AWS Resilience Hub:
aws resiliencehub list-test-recommendations --assessment-arn <assessment_arn>
Note: <assessment_arn> is the ARN of the initial assessment generated in Step 1.
Following is a JSON example of the response you can expect from AWS Resilience Hub:
{
"testRecommendations": [
{
"appComponentName": "ComputeAppComponent-EC2Instance",
"description": "Runs the Amazon EC2 API action StopInstances on the target EC2 instances.",
"items": [
{
"alreadyImplemented": false,
"excluded": false,
"resourceId": "arn:aws:ec2:us-east-2:123456789012:instance/i-0704a3bd911e7139d",
"targetAccountId": "123456789012",
"targetRegion": "us-east-2"
}
],
"name": "aws:ec2:stop-instances",
"recommendationId": "f290b4b9-fed5-4786-8e08-e7c42aae2ef8",
"recommendationStatus": "NotImplemented",
"referenceId": "aws:ec2:stop-instances",
"risk": "Medium",
"type": "Hardware"
}
]
}
We will focus on two key elements, which we’ll use to construct the experiment template:
- resourceId: The ARN of the resource that AWS Resilience Hub recommends for testing.
- referenceId: This attribute indicates the AWS Fault Injection Service Action to be performed:
- For standard AWS Fault Injection Service actions (for example, aws:ec2:stop-instances), it directly corresponds to the action.
- For AWS Systems Manager SSM Document actions, it follows the format aws:ssm:send-command/<ssm_document_name>, where the document name is a public Amazon-owned document.
It’s important to note that AWS Resilience Hub uses the alreadyImplemented and latestDiscoveredExperiment attributes to track executed experiments and update the application’s resilience score accordingly. This feature confirms that if customers have already implemented AWS Fault Injection Service experiments, prior to running an AWS Resilience Hub Assessment, these will be recognized and reflected in the assessment results as already implemented. This intelligent tracking mechanism prevents redundant recommendations and provides a more accurate representation of the application’s current resilience status, considering pre-existing fault injection testing efforts.
In Step 4, after we have run the experiment, we will provide the JSON structure with the attributes alreadyImplemented = true and latestDiscoveredExperiment populated.
Step 3 – NEW Enhanced API Integration – Experiment Template Creation
When creating an AWS Fault Injection Service experiment template, you’ll use the recommendations from AWS Resilience Hub to format it in JSON. This can be accomplished using the create-experiment-template AWS CLI command:
aws fis create-experiment-template --cli-input-json file://experiment-template.json
If you prefer, you can also use the CreateExperimentTemplate API. Regardless of option, for more complex templates, it’s advisable to use a JSON file for better organization and readability.
The AWS Fault Injection Service experiment template JSON file (experiment-template.json in our case) must have the following structure:
{
"description": "string",
"targets": {},
"actions": {},
"stopConditions": [],
"roleArn": "arn:aws:iam::123456789012:role/AllowFISActions",
"experimentReportConfiguration":{},
"experimentOptions":{},
"tags": {}
}
Note: Verify that the role you are using (roleArn) has all the permissions required by the AWS Fault Injection Service action.
Let’s explore how to construct the actions and targets elements in the preceding experiment template based on the list-test-recommendations output.
Constructing Actions element
We begin by constructing the actions element. Each template must include at least one action. An action defines the specific disruption or fault to be introduced during an experiment. To define an action within the JSON structure, use the following format:
{
"actions": {
"<action_name>" : {
"actionId": "<action_id>",
"parameters": {
"<parameter_name>": "<parameter_value>"
},
"targets": {
"<action_resource_type>" : "<target_name>"
}
}
}
}
To define an action, we need to provide the following values:
- action_name: A name we specify for the action
- action_id: This is the action identifier, which you will always find in the referenceId of the testRecommendations object. For all AWS Fault Injection Service actions, except for AWS Systems Manager SSM Document actions, you can directly use the referenceId as the action_id. For AWS Systems Manager SSM Document actions, you should use only the first part of the referenceId as the action_id. The format for the referenceId for AWS Systems Manager SSM Document actions is aws:ssm:send-command/<ssm_document_name>. The format for the action_id you need to use is aws:ssm:send-command
- parameters: Depending on the action, this field may be optional or required. Consult the AWS Fault Injection Service Actions reference documentation for specific parameter requirements associated with different actions. For AWS Systems Manager SSM Document actions, the documentArn parameter is mandatory and can be constructed using the referenceId as shown in the following example:
"parameters": {
"documentArn": "arn:aws:ssm:<region>::document/<ssm_document_name>",
"documentParameters": "<document_parameters>",
"duration": "<duration>"
}
Note that some AWS Systems Manager SSM Document actions will require additional documentParameters, which you will need to provide.
- action_resource_type: This specifies the target resource for the AWS Fault Injection Service action and is predefined by AWS Fault Injection Service. Each action_id is associated with a specific action_resource_type. To find the corresponding resource type for an action_id, use the following command:
aws fis get-action --id "<action_id>" --query "keys(action.targets)" --output text
For example, the following command will return Instances
:
aws fis get-action --id "aws:ec2:stop-instances" --query "keys(action.targets)" --output text
- target_name: A user-defined value that must correspond to the target’s definition (outlined in the next section).
Constructing Targets Element
Now, we can focus on constructing the targets element. A target represents the AWS resources that AWS Fault Injection Service will act upon during the experiment. To define a target within the JSON structure, use the following format:
"<target_name>" : {
"resourceArns": [ "<resource_arn>" ],
"resourceType": "<target_resource_type>",
"parameters". : { "<parameter-name>": "<parameter-value>" },
"resourceTags": {"<tag-key>": "<tag-value>"},
"selectionMode": "ALL"
}
To define a target, we need to provide the following values:
- target_name: This must match the target_name you specified in the preceding actions section.
- target_resource_type: This is the resource type you should use in your target definition. You can obtain the target_resource_type by using the following command, where you provide the action_id (derived from the testRecommendations referenceId):
aws fis get-action --id "<action_id>" --query "action.targets.*.resourceType" --output text
For example, the following command will return aws:ec2:instance:
aws fis get-action --id "aws:ec2:reboot-instances" --query "action.targets.*.resourceType" --output text
- A list of resources to target in our experiment: You can specify the resources you want to target in three ways, depending on the action type:
- resourceArns attribute: For all action types {except Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS), Amazon ElastiCache and Amazon Simple Storage Service (Amazon S3) actions} provide their ARNs in the resourceArns field.
- parameters attribute: For Amazon ECS and Amazon EKS actions provide a list of required parameters in the parameters field.
- resourceTags attribute: For all types of actions (except Amazon ECS and Amazon EKS actions) you can provide a list of tags in the resourceTags field. Note that for Amazon ElastiCache and Amazon S3 actions, using resourceTags is mandatory.
One of the recommended fault injection experiments from our list-test-recommendations command was the aws:ec2:stop-instances experiment. Let’s see how to construct the resources for this experiment type using the response we received.
For the aws:ec2:stop-instances action, we can directly use the resourceId returned in the testRecommendations response as the resourceArns value in our JSON. For this action, the parameters and resourceTags properties are not required.
Following is an example of a JSON file for an experiment template designed to run the aws:ec2:stop-instances action:
{
"description": "Experiment based on ARH recommendation",
"stopConditions": [{"source": "none"}],
"targets": {
"target_name": {
"resourceType": "aws:ec2:instance",
"resourceArns": ["arn:aws:ec2:us-east-2:123456789012:instance/i-0704a3bd911e7139d"],
"selectionMode": "ALL"
}
},
"actions": {
"RebootEC2": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "target_name"
}
}
},
"roleArn": "arn:aws:iam::123456789012:role/AllowFISActions"
}
Although our example doesn’t use parameters or resourceTags, it’s important to understand how to construct these elements for actions that don’t rely on resourceArns. This knowledge is valuable when designing experiments for architectures that require different target specifications.
Construct the parameters attribute
To target Amazon ECS task actions, you’ll need to provide the cluster and service names parameters. Both can be derived from the testRecommendations resourceId, which is formatted like the following for this type of action: arn:aws:ecs:<region>:<account_id>:service/<cluster_name>/<service_name>.
By parsing this resourceId you can extract the necessary information and construct your parameters value as follows:
"parameters" : {
"cluster": "<cluster_name>",
"service": "<service_name>"
}
To target Amazon EKS actions, you’ll need to provide the clusterIdentifier, namespace, and deploymentName parameters. The testRecommendations resourceId is formatted as follows for these tasks: <cluster_arn>/<namespace>/<deployment_name>
Parse this resourceId to extract the necessary information and construct your parameters value accordingly with the following:
"parameters" : {
"clusterIdentifier" : "<cluster_arn>",
"namespace" : "<namespace>",
"selectorType" : "deploymentName",
"selectorValue" : "<depmoyment_name>"
}
Construct the resourceTags attribute
To target Amazon ElastiCache and Amazon S3 actions, you’ll need to specify tags in the resourceTags element. You can also use resourceTags instead of resourceArns to target all other types of actions (except Amazon ECS and Amazon EKS actions).
The resourceTags value in your JSON structure should then have the following format:
"resourceTags" : {
"<tag_key>": "<tag_value>"
}
We recommend one of the following two approaches for tagging resources in AWS Fault Injection Service experiments:
- Implement a Targeted Tagging Strategy: Develop a tagging strategy that applies only to the resources you will subject to AWS Fault Injection Service tests. This method confirms that only the designated resources are included in the experiment.
- Apply a Specific Tag before experiment: First, apply a specific tag to the resources you intend to test. Then, specify this tag in your resourceTags parameter. For example, you can tag resources intended for testing with a specific identifier like this:
aws resourcegroupstaggingapi tag-resources --resource-arn-list <resource-arns> --tags <tag_key>=<tag_value>
Use <tag_key> and <tag_value> such as aws_resilience_hub_recommendationId and recommendationId, respectively, where recommendationId is provided in the testRecommendations. You can un-tag your resources after the experiment if you want to use the untag-resources sub command.
Now that we’ve constructed our input JSON file, we are ready to create the AWS Fault Injection Service experiment template. The successful creation of the template using the create-experiment-template command returns a unique identifier, referred to as the id. This id is a required input when executing the actual experiment in the next step.
Step 4 – Execute and Monitor Experiments
With the AWS Fault Injection Service experiment template now created, we can execute and monitor the experiments using the established flow. This process involves using existing AWS Fault Injection Service AWS CLI commands or API calls to start the experiment and track its progress. If you need a refresher on the specific commands or steps involved in this process, please refer to the AWS Fault Injection Service start an experiment documentation and monitoring AWS Fault Injection Service experiments.
Step 5 – Reassessment and Score Update
Upon completion of the experiment, initiate a reassessment of the application using the same AWS Resilience Hub commands as in Step 1. Once the reassessment completes, you can verify improvements in its resilience score using the AWS Resilience Hub describe-app-assessment command.
An increased score confirms that the application has successfully undergone the prescribed chaos engineering test. The output will display the updated resilience score as part of the assessment object, as shown in the following:
"resiliencyScore": {
"componentScore": { "..." },
"Compliance": { "..." },
"Sop": { "..." },
"Test": { "..." },
"disruptionScore": { "..." },
"score": 0.42
}
We execute the list-test-recommendations command again to retrieve the updated test recommendations, focusing on the alreadyImplemented and latestDiscoveredExperiment fields. Notably, you’ll see that the alreadyImplemented property has changed to true. The latestDiscoveredExperiment field now displays the experimentArn and experimentTemplateId, providing details about the most recent experiment execution. This status update serves as concrete evidence that the experiment has been successfully executed and integrated into the resilience testing strategy.
The following is the example list-test-recommendations response JSON, which shows the updated results:
{
"alreadyImplemented": true,
"excluded": false,
"latestDiscoveredExperiment": {
"experimentArn": "arn:aws:fis:us-east-1:123456789012:experiment/EXPXsqxydxakRwCVHU",
"experimentTemplateId": "EXTASiNmAcTW6TEMe"
},
"resourceId": "arn:aws:ec2:us-east-1:123456789012:instance/i-01acafdaee4784ae4",
"targetAccountId": "123456789012",
"targetRegion": "us-east-1"
}
Summary
We’ve explored the enhanced API integration between AWS Resilience Hub and AWS Fault Injection Service using the AWS CLI. By incorporating these regular post-experiment application reassessments into your CI/CD pipelines or automation workflows, you’re enabling continuous tracking of improvements in your application’s resilience.
This integration allows you to automatically identify areas requiring attention after each experiment or deployment cycle. Remember that resilience is an ongoing process.
As you iterate on your applications, these automated assessments provide valuable insights into the impact of your changes on overall resilience. This helps your team maintain and enhance your application’s ability to withstand and recover from disruptions over time.
Contact an AWS Representative to know how we can help accelerate your business.
Further Reading
- Implementing recommended experiments using the AWS Resilience Hub console
- Resilience Lifecycle Framework
- AWS best practices for resilience testing
- Leverage AWS Resilience Lifecycle Framework to assess and improve the resilience of application using AWS Resilience Hub
About the authors