Containers
Automating Amazon EKS cluster testing with custom machine images
AWS recently launched a new service, EC2 Image Builder, which automates and simplifies the creation, maintenance, and validation of Amazon Machine Images (AMIs). Many of our customers are using this service to generate their own customized, hardened images.
In this post, we will demonstrate how you can automatically test your Amazon Elastic Kubernetes Service (Amazon EKS) cluster with a newly baked image. Currently, EKS managed node groups do not support custom AMIs. This post will be updated once that feature is launched.
Overview of solution
In this solution, we will use AWS CDK to provision an EKS cluster with self-managed worker nodes. We will generate an EKS cluster with a blue worker node group. Then, generate a test node group (with taints) to do preliminary validation of the new AMI. If all the tests pass, we will deploy a green worker node group with the new AMI and gracefully move workloads over. This solution also provides an AWS CodePipeline pipeline, which automates this process.
Please note that in this solution, we are not considering PodDisruptionBudget objects (PDB). If a PDB is configured for a deployment, Kubernetes will prevent nodes from being drained.
The steps executed by the pipeline are as follows:
- Generate a new EKS cluster with a blue worker node group. The autoscaling group’s launch configuration has the AMI ID used by the blue worker node group.
- When a new AMI is released, deploy a test worker node group, “Test”, to use the new AMI and also launch a single node. This is a temporary node group used just to validate the new AMI.
- Execute all necessary tests on “Test”. If successful, deploy the green worker node group with the full capacity and migrate applications over from blue to green.
- Scale blue worker nodes down to zero once all the pods have migrated to the green worker node group. Note the following:
- The blue and green worker node groups are provisioned with the same number of workers. The number of workers in the node group should not be decreased during this migration.
- If a deployment only has 1 replica, the node on which it is running will need to be terminated manually. We recommend increasing the number of replicas to force Kubernetes to balance out the load between the worker node groups.
- If PDB is being used, the number of replicas and the disruption budget needs to be configured to allow deployments to be migrated from one worker node group to another without interruptions.
Prerequisities
Before starting you need the following:
- An AWS account
- VPC and Subnets for EKS
- An IAM user, which will be granted administrator privileges to the EKS cluster.
- A custom EKS Optimized AMI
- AWS CLI
- AWS CDK
- kubectl
Getting Started
- Install AWS CDK if you haven’t done so already. Instructions here: https://docs.thinkwithwp.com/cdk/latest/guide/getting_started.html
- Clone the git repo
git clone https://github.com/aws-samples/eks-ami-tester cd eks-ami-tester
- Update the CDK project configuration (config/project-config.json) according to your environment:
{ "account_id": "XXXXXXXXXXXX", "region": " XXXXXXXXXXXX ", "vpc_id": "vpc-XXXXXXXXXXXX", "cluster_name": "AMI-Test", "eksadmin_user_name": "XXXXXXXXXXXX", "office_eks_api_cidr": "XXXXXXXXXXXX", "default_worker_count": 1, "default_ami_name": "amazon-eks-node-1.15-v20200406", "test_ami_name": "amazon-eks-node-1.15-v20200406", "pipeline_approval_email": "XXXXXXXXXXXX", "next_version": "Green" }
- Build CDK application and deploy the EKS cluster and the blue worker node group. Finally, deploy the CodePipeline stack which automates this entire process.
npm install npm run build cdk deploy AmiTestEksClusterIam cdk deploy AmiTestEksCluster cdk deploy AmiTestEksClusterWorkersBlue -c version=Blue cdk deploy AmiTestEksWorkerPipeline
The CodePipeline stack automates the worker node group migration. The manual steps below deep dive into the automation and walk you through it step by step.
Manual Steps
Deploying a new node group
Deploy the test AMI defined in “test_ami_name
” to the green test worker node group.
- Authenticate the EKS cluster using the credentials of “eksadmin_user_name” and deploy sample pods.
aws eks update-kubeconfig --name EKSCLUSTER_NAME --region AWS_REGION kubectl apply -f scripts/nginx-deployment.yaml kubectl get pods -l app=nginx NAME READY STATUS RESTARTS AGE nginx-deployment-7fd6966748-cchsz 1/1 Running 0 10s nginx-deployment-7fd6966748-hrvdf 1/1 Running 0 10s nginx-deployment-7fd6966748-zsfdx 1/1 Running 0 10s
- Provision the “Test” worker node group by running the Test AMI and deploy a test application.
- When test is set to true the nodes are tainted with DeployGroup=Test:NoSchedule
- The test application is being deployed with tolerations. The toleration is set in this file:
scripts/nginx-deployment-testapp.yaml
cdk deploy AmiTestEksClusterWorkersTest -c version=Test -c test=true kubectl get nodes -l DeployGroup=Test NAME STATUS ROLES AGE VERSION ip-X-X-X-X.ca-central-1.compute.internal Ready <none> 15s v1.15.10-eks-bac369 kubectl apply -f scripts/nginx-deployment-testapp.yaml kubectl get pods -l app=nginx-testapp NAME READY STATUS RESTARTS AGE nginx-deployment-testapp-79cc9cc8b-8cpm4 1/1 Running 0 51s nginx-deployment-testapp-79cc9cc8b-hgl8d 1/1 Running 0 51s nginx-deployment-testapp-79cc9cc8b-krqrv 1/1 Running 0 51s kubectl get deployment -l app=nginx-testapp -o jsonpath={.items[].status.conditions[].status} True
Transition pods to new worker node group
- Delete the test application and disable the “Test” worker node group by setting the desired count to zero.
kubectl delete -f scripts/nginx-deployment-testapp.yaml cdk deploy AmiTestEksClusterWorkersTest -c version=Test -c test=true -c desiredCount=0
- Deploy the green worker node group and verify that it joined the EKS cluster.
cdk deploy AmiTestEksClusterWorkersGreen -c version=Green -c amiName=$(cat config/project-config.json | jq -r '.test_ami_name') kubectl get nodes -l DeployGroup=Green NAME STATUS ROLES AGE VERSION ip-X-X-X-X.ca-central-1.compute.internal Ready <none> 15s v1.15.10-eks-bac369
- Cordon the blue worker node group to stop the scheduling of the new pods.
kubectl cordon -l DeployGroup=Blue node/ip-X-X-X-X.ca-central-1.compute.internal cordoned
- Gracefully move the active pods over to the new green worker node group. Keep monitoring the Kubernetes cluster and the deployments to ensure their stability during this process.
kubectl drain -l DeployGroup=Blue --ignore-daemonsets --delete-local-data --grace-period=10 WARNING: ignoring DaemonSet-managed Pods: kube-system/aws-node-jw4mx, kube-system/kube-proxy-9cs4r evicting pod "nginx-deployment-7fd6966748-zsfdx" evicting pod "nginx-deployment-7fd6966748-hrvdf" evicting pod "nginx-deployment-7fd6966748-cchsz" evicting pod "coredns-64d7c8b445-9l2v7" pod/nginx-deployment-7fd6966748-hrvdf evicted pod/nginx-deployment-7fd6966748-zsfdx evicted pod/coredns-64d7c8b445-wbg9n evicted pod/nginx-deployment-7fd6966748-cchsz evicted node/ip-X-X-X-X.ca-central-1.compute.internal evicted kubectl get pods -l app=nginx NAME READY STATUS RESTARTS AGE nginx-deployment-7fd6966748-bl6gx 1/1 Running 0 10s nginx-deployment-7fd6966748-dcxmn 1/1 Running 0 10s nginx-deployment-7fd6966748-w5tnt 1/1 Running 0 10s
- Once the green worker node group is stable and running all the applications, we can disable the blue worker node group.
cdk deploy AmiTestEksClusterWorkersBlue -c version=Blue -c desiredCount=0
Automated Solution
As part of the prerequisites, you also deployed a stack called AmiTestEksWorkerPipeline
. This stack generates a new AWS CodeCommit repository along with pipeline which automates the manual steps described above. Once you navigate to CodeCommit you will see a new repository called EksUpdatePipelineRepo
. Push the contents of your eks-ami-tester
directory to this repository. This will trigger the EksUpdatePipeline
in CodePipeline.
The pipeline has the following stages:
Source: EksUpdatePipelineRepo acts as a source trigger for the the pipeline. The pipeline is triggered on every push to the master branch.
Test: Test stage executes steps from “Deploying a new node group” section.
Approve: Approve stage triggers the approval process by sending email to “pipeline_approval_email” specified in CDK project configuration. Manual approval is required before proceeding to the update stage. The default timeout for manual approval is 7 days.
Update: Update stage executes the steps from the “transition pods to new worker node group” section. This is an AWS CodeBuild stage which is leveraging kubectl. The CodeBuild default timeout is 8 hours. To request an increase, use the support center console.
Codepipeline assumes that either a blue or green node group has already being deployed to perform the updates. If the EKS cluster has no active node groups, pipeline will deploy the blue node group and exit by default. The pipeline also compares the value of “next_version” from CDK project configuration with the Kubernetes label (DeployGroup) of the deployed node group. These values have to be different in order to trigger the update. For example, if the deployed node group label is version blue, then the “next_version” defined in the CDK project configuration should be green.
Cleaning up
To avoid incurring future charges, delete the resources.
cdk destroy AmiTestEksClusterWorkersBlue -c version=Blue
cdk destroy AmiTestEksClusterWorkersTest -c version=Test -c test=true
cdk destroy AmiTestEksClusterWorkersGreen -c version=Green
cdk destroy AmiTestEksCluster
cdk destroy AmiTestEksClusterIam
cdk destroy AmiTestEksWorkerPipeline
Conclusion
Many of our customers have strict regulations which put constraints on the usage of AMIs. This article provides a starting point towards building a fully automated EKS AMI deployment solution. These steps can be executed by your CI/CD orchestration tool such as AWS CodePipeline.
Get involved
This project is currently under development. AWS welcomes issues and pull requests, and would love to hear your feedback.