AWS Cloud Operations Blog
Enhance Amazon EKS Containerized Application Resilience with AWS Resilience Hub
Building and managing resilient, micro-service based Containerized applications in a distributed environment is hard; maintaining and operating them is even harder. Even though containerized applications running on Amazon Elastic Kubernetes Service (Amazon EKS) take advantage of the performance, scale, reliability, and availability of AWS infrastructure which, we need to understand that failures will occur and we should always be prepared.
The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.” It is typically measured by two metrics: Recovery Time Objective (RTO), the time it takes to recover from a failure, and Recovery Point Objective (RPO), the maximum window of time in which data might be lost after an incident. Depending on your business and application, these can be measured in seconds, minutes, hours, or days.
It’s important to ensure your containerized applications have been developed using good resiliency design principles. In November 2021 we launched AWS Resilience Hub, a service that provides a central place to help organizations define, validate, and track the resilience of AWS native applications by analyzing the services that make up an application. We are excited to now announce Amazon EKS as our newest supported service under the AWS Resilience Hub umbrella.
In this post, we will show you how to proactively improve the resilience of modern containerized workload running on Amazon EKS with AWS Resilience Hub. As part of the post, we will deploy an Amazon EKS cluster that will host a sample micro-service based application named sock-shop. After we deploy the sock-shop application, we will discover the resources in the application by adding Amazon EKS Cluster and cluster namespace to AWS Resilience Hub. We will then run the resiliency assessment to indicate whether the Amazon EKS hosted application is resilient. The estimated resiliency will be bench marked against the target RPO and RTO metrics, which will be defined in a resiliency policy. Lastly, we will dive into both the Resiliency assessment report generated by AWS Resilience Hub.
Solution overview
The following diagram depicts the architecture of the solution deployed as part of this blog.
Figure 1: Architecture Diagram
The solution in the blog includes the following services:
- AWS Resilience Hub
- Amazon EKS
- AWS IAM
- AWS Cloud9 (optional)
Prerequisites
- An AWS account with admin privileges: For this blog, we will assume you already have an AWS account with admin privileges.
- Command line tools: Users need to install the latest version of AWS CLI, aws-iam-authenticator, kubectl, and eksctl on their IDE workstation. You also have the option to create a Cloud9 environment in AWS and then install these CLIs.
Complete the following prerequisites before deploying the solution. You can either use AWS Cloud9 or IDE of your choice.
Deploy EKS Cluster and Sample Application
Step 1: Create an EKS Cluster
To set up your workspace and get started with this post, open your favorite browser in your Mac/Linux/Windows workstation
- Follow this tutorial to deploy an EKS cluster to use with this blog .
After you create your Amazon EKS cluster, you must configure your kubeconfig file using the AWS CLI. This configuration allows you to connect to your cluster using the kubectl command line. The following update-kubeconfig command will create a kubeconfig file for your cluster. Test and verify your cluster is up, you can reach/access it by running any kubectl get command.
aws eks update-kubeconfig —region us-east-2 —name eks-resilience-cluster
kubectl get nodes
Step 2: Deploy sample application on Amazon EKS Cluster
The next thing we need to do is deploy our sample application on Amazon EKS Cluster
- Clone sock-shop application repository in the working directory of your IDE, then change the directory to application deployment manifest. Open “complete-demo.yaml” in your favorite editor, change the service type in front-end micro-service from NodePort to LoadBalancer then deploy application by running kubectl apply command.
git clone https://github.com/microservices-demo/microservices-demo.git
cd ./microservices-demo/deploy/kubernetes
kubectl apply -f complete-demo.yaml
- Test and verify that sock-shop application is up and running by running the below command. You should see an similar output to shown below
Figure 2: sock-shop application status
Step 3: Allow AWS Resilience Hub access to the EKS cluster
Amazon EKS cluster access using AWS Identity and Access Management (IAM) entities is enabled by the AWS IAM Authenticator for Kubernetes, which runs on the Amazon EKS control plane. The IAM authenticator gets its configuration information from the aws-auth ConfigMap. For more information see Enabling IAM user and role access to your cluster – Amazon EKS.
AWS Resilience Hub queries resources inside Amazon EKS cluster by assuming an IAM role in your account. This IAM role is mapped to a Kubernetes group and grants the required permission to assess the Amazon EKS cluster.
Figure 3: IAM Process Flow
The following steps grant AWS Resilience Hub with the required permissions to discover resources inside your Amazon EKS cluster.
- Create an IAM role named AwsResilienceHubAssessmentEKSAccessRole.
This role will be assumed by AWS Resilience Hub when importing and assessing your application. It will be mapped with an Amazon EKS group that enables the AWS Resilience Hub to assess our Amazon EKS cluster.
In AWS we manage access by creating policies and attaching them to IAM identities (users, groups of users, or roles) or AWS resources. To define IAM policy for the role run the below commands
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export POLICY=$(echo -n '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'; echo -n "$ACCOUNT_ID"; echo -n ':root"},"Action":"sts:AssumeRole","Condition":{}}]}')
aws iam create-role \
--role-name AwsResilienceHubAssessmentEKSAccessRole \
--description="Amazon Resilience Hub read only role (for AWS IAM Authenticator for Kubernetes)." \
--assume-role-policy-document "$POLICY"
- Create a Resilience Hub ClusterRole and RoleBinding/ClusterRoleBinding
To grant AWS Resilience Hub read access across all namespaces create the required ClusterRole and ClusterRoleBinding by running below command.
Note: In your Production environment, scope this to particular namespace and follow principle of least privilege by creating Role and RoleBinding.
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: resilience-hub-eks-access-cluster-role
rules:
- apiGroups:
- ""
resources:
- pods
- replicationcontrollers
- nodes
verbs:
- get
- list
- apiGroups:
- apps
resources:
- deployments
- replicasets
verbs:
- get
- list
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- get
- list
- apiGroups:
- autoscaling.k8s.io
resources:
- verticalpodautoscalers
verbs:
- get
- list
- apiGroups:
- autoscaling
resources:
- horizontalpodautoscalers
verbs:
- get
- list
- apiGroups:
- karpenter.sh
resources:
- provisioners
verbs:
- get
- list
- apiGroups:
- karpenter.k8s.aws
resources:
- awsnodetemplates
verbs:
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: resilience-hub-eks-access-cluster-role-binding
subjects:
- kind: Group
name: resilience-hub-eks-access-group
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: resilience-hub-eks-access-cluster-role
apiGroup: rbac.authorization.k8s.io
---
EOF
Then create a mapping between the IAM role AwsResilienceHubAssessmentEKSAccessRole , with the Kubernetes group resilience-hub-eks-access-group , granting the IAM roles permissions to access resources inside the Amazon EKS cluster.
eksctl create iamidentitymapping \
--cluster eks-resilience-cluster \
--region=us-east-2 \
--arn arn:aws:iam::"$ACCOUNT_ID":role/AwsResilienceHubAssessmentEKSAccessRole \
--group resilience-hub-eks-access-group \
--username AwsResilienceHubAssessmentEKSAccessRole
- Create an IAM role named AwsResilienceHubPeriodicAssessmentRole.
To grant AWS Resilience Hub access to perform scheduled assessments we must enable the required IAM roles and permissions to activate the daily assessment. With scheduled assessments, AWS Resilience Hub assess your application daily. Assessments will use the latest imported application to assess existing resources and their configuration changes.
For more information see AWS Resilience Hub Scheduled Assessment Role.
Running your first Resiliency assessment
The AWS Resilience Hub assessment uses best practices from the AWS Well-Architected Framework to analyze the components of an application and uncover potential resilience weaknesses. These weaknesses may be caused by incomplete infrastructure setup, misconfiguration, or during architecture drift.
Follow the below steps to run Resiliency assessment of sock-shop application running on Amazon EKS as deployed above.
1. Add Amazon EKS Cluster and Sample Application to AWS Resilience Hub
- Launch AWS Resilience Hub → Click Add Application
Enter the following (screenshot below)
-
- Application Name: sock-shop
- Description: Sock-Shop Application Hosted on EKS
- How is this application managed? Select EKS Only
- Add EKS clusters
- Select EKS Clusters: select the eks-resilience-cluster
- Cross account or region: Blank. You can specify the EKS cluster ARN if your EKS cluster is in a different account or region, or both. You can skip this.
Figure 4: Add EKS to AWS Resilience Hub
-
-
- Add namespaces to each EKS cluster: select the eks-resilience-cluster and click Update Namespaces
-
Figure 5: Update EKS Namespace to AWS Resilience Hub
-
- Under Add namespace, enter sock-shop, check the box to use the namespaces and click Save
- Under Scheduled assessment Check the option that enables required IAM roles and permissions and then Click Next
NOTE: AWS Resilience Hub can run a daily assessment of your application. You can turn off this setting and manually run the assessment on your own schedule. When enabled, the daily assessment schedule begins only after the application is manually assessed successfully for the first time and if the AwsResilienceHubPeriodicAssessmentRole IAM role is created. This is optional. For this blog, required role has been added in the above steps.
- After a few minutes, the Supported resources from the sock-shop eks cluster will be listed. You can select specific resources type to include or exclude in your assessment. For this blog, we will leave as defaults. Click Next
NOTE: The AWS Resilience hub supports discovering Deployment, Replicaset and Pods resources only at the time of this writing. In the future release, other Kubernetes resources will be supported.
- Under Select policy , Click Create resiliency policy
- Under Create resiliency policy, Select the below options/values for the purpose of this blog post
- Choose a creation method : Select a policy based on a suggested policy
- Policy name: sock-shop-foundational-core
- Suggested resiliency policies: Foundational Core Service
- Choose Create
- Select the policy and choose Next
- Review the configuration on the next page and Choose Publish
Step 2 – Run Resiliency assessment
- Under Applications on the AWS Resilience Hub, click your application sock-shop
- You can create and run resiliency assessment in couple of ways. You can either
- Click Assessments tab and then Run new resiliency assessment
- OR Click Assess resiliency
Figure 6: AWS Resilience Hub Workflow
- Give the name to the report for eg. – sock-shop-res-assess and then click Run
- The Resiliency assessment will list the assessment with status “Pending”. You can refresh the assessment and the status will change to “In Progress”. It will take a few minutes to finish the assessment to “Success”
Reviewing your first Resiliency assessment and recommendations
The Resiliency assessment provides an overview of the assessment report. AWS Resilience Hub lists each disruption type and the associated application component. It also lists your actual RTO and RPO policies and determines whether the application component can achieve the policy goals.
To review your assessment, follow the below steps
- After the assessment status changes to “Success”. Click on report sock-shop-res-assess
- Next to the assessment name, you will see either the “Policy met” or “Policy breached”. If you followed the above blog instructions it will be “Policy breached”. Click on the report sock-shop-res-assess assessment.
Figure 7: AWS Resilience Hub assessment
- The report is broken primarily in 3 sections/tabs. The Results, Resiliency recommendations and Operational recommendations. The Results tab lists the summary of the RTO and RPO, Estimated against the Targeted. The results also provides detailed descriptions of each disruption type (application, infrastructure, Availability Zone, and Region).
Figure 8: AWS Resilience Hub Recommendations
As you see above, AWS Resilience Hub has identified 14 breaches each across Infrastructure, Availability Zone and Region.
- Lets expand into Infrastructure breaches.. Toggle the Infrastructure tab
- Click on the Estimated RTO for the top AppComponent. A pop up text explains in detail the reason for the AppComponent breach. Feel free to explore the other AppComponent’s in this list
Figure 9: AWS Resilience Hub AppComponent Recommendations
- Now that we have looked at the breaches, lets look at the Resiliency recommendations to fix the policy breaches. Resiliency recommendations evaluate application components and recommend optimization changes by RTO and RPO, costs, and minimal changes.
- Click on Resiliency recommendations tab
Figure 10: AWS Resilience Hub Resiliency Recommendations
- Under AppComponents, select the top component. You will see the benefits for fixing the AppComponent. For this selection you will see “Optimize for Cost, minimal changes and Best Region RTO/RPO” as the benefits. The Recommendation also suggests Changes to fix the policy compliance. In this example, there are 8 changes recommended to address the application readiness.
Figure 11: AWS Resilience Hub Resiliency Recommendations for EKS resources
Cleanup
When you’re done testing, delete the resources you created so that you’re no longer billed for them. To clean everything, follow these steps:
- Remove the application from AWS Resilience Hub :
- Go to AWS Resilience Hub Console → Click Applications → select “sock-shop” → Click “Actions” → Delete
Figure 12: AWS Resilience Hub Resiliency Recommendations for EKS resources
- Remove Sample Application, AWS Resilience Hub ClusterRole and ClusterRoleBindings from Amazon EKS Cluster by running below commands in your terminal
cd ~/microservices-demo/deploy/kubernetes
kubectl delete -f complete-demo.yaml
kubectl delete clusterrolebinding name resilience-hub-eks-access-cluster-role-binding
kubectl delete clusterrole name resilience-hub-eks-access-cluster-role
- Delete the Amazon EKS Cluster and AWS IAM role from AWS Management console.
Summary
In this post, we looked at how to enhance Amazon EKS Containerized Application Resilience with AWS Resilience Hub. We deployed a sample application on Amazon EKS and created Resiliency assessment for this application using AWS Resilience Hub. We reviewed the results of the assessment against the target RPO and RTO metrics defined in the resiliency policy. In the next post, we will demonstrate how you can run assessment of other Amazon Kubernetes resources like StatefulSet, DaemonSets, Jobs, Service, Ingress and ClusterAutoscaler using AWS Resilience Hub to uncover potential resilience weaknesses. Stay tuned!