AWS Cloud Operations Blog
Enhancing observability with a managed monitoring solution for Amazon EKS
Introduction
Keeping a watchful eye on your Kubernetes infrastructure is crucial for ensuring optimal performance, identifying bottlenecks, and troubleshooting issues promptly. In the ever-evolving world of cloud-native applications, Amazon Elastic Kubernetes Service (EKS) has emerged as a popular choice for deploying and managing containerized workloads. However, monitoring Kubernetes clusters can be challenging due to their complexity and AWS recently launched Amazon CloudWatch Container Insights to simplify the process. Imagine having a monitoring solution tailored specifically for your EKS clusters using Open Source, delivering real-time insights into the health and performance of your Kubernetes environment. With this, users can monitor a Kubernetes cluster’s real-time state to quickly identify issues or bottlenecks, spotting problems like memory leaks in individual containers through container-level metrics and visually analyzing across different cluster layers. With the combined power of both Amazon Managed Grafana and Amazon Managed Service for Prometheus, you can now deploy an AWS-supported solution for monitoring EKS infrastructure.
With this solution, you can deploy a fully-managed Prometheus backend to collect and store metrics from your EKS cluster, while leveraging the intuitive visualization capabilities of Amazon Managed Grafana. A set of preconfigured dashboards will provide you with a holistic view of the health, performance, and resource utilization of your cluster. Whether you’re managing a small development cluster or a large-scale production environment, these dashboards offer better insights. From assessing the overall cluster health to monitoring the control and data planes, you’ll have a comprehensive understanding of your Kubernetes ecosystem. Additionally, you can dive deeper into workload performance across namespaces, track resource usage (CPU, memory, disk, and network), and identify potential bottlenecks before they escalate. In the following sections, we’ll explore the power of this AWS-managed solution, guiding you through the process of deploying and utilizing the pre-built CloudFormation template. Get ready to unlock a new level of visibility and control over your Amazon EKS infrastructure, empowering you to make informed decisions and optimize your Kubernetes environment for optimal performance.
Prerequisites
You will need the following resources and tools to deploy the solution:
- AWS Command Line Interface (AWS CLI) version 2
- eksctl
- kubectl
- Helm
- jq
- git
- Amazon Managed Service for Prometheus
- Amazon Managed Grafana
Solution Overview
This AWS-managed solution offers a comprehensive monitoring framework for your Amazon Elastic Kubernetes Service (EKS) clusters. The solution empowers you with anticipatory capabilities, enabling you to drive intelligent scheduling decisions based on historical usage tracking, plan for future resource demands by analyzing current utilization data, and identify potential issues early by monitoring resource consumption trends at the namespace level. On the corrective front, you can quickly troubleshoot and reduce mean time to detection (MTTD) of issues across infrastructure and workloads using the pre-configured troubleshooting dashboard. With this AWS-managed solution tailored for Amazon EKS clusters, you gain monitoring and observability capabilities. Stay ahead of performance bottlenecks, optimize resource utilization, and maintain a healthy and efficient Kubernetes environment through deep insights into your cluster’s health, performance, and resource usage.
To use this solution, we need to have an EKS cluster, Amazon Managed Service for Prometheus workspace and Amazon Managed Grafana workspace. First four steps below covers setting up of these prerequisites. Then we deploy the cloud formation stack to deploy the solution and visualize the results. Finally we see the cost involved and the cleanup section.
Fig 1. Data Flow diagram
Step 1: Setup the environment variables and artifacts
Step 2: Create an Amazon EKS Cluster
An Amazon EKS cluster can be created using the eksctl command line tool which provides a simple way to get started for basic cluster creation with sensible defaults as below.
Lets create an IAM role with access to the cluster and store the results with environment variables.
Lets create an access entry for the above IAM role, and give the EKSClusterAdmin access
Step 3: Create Amazon Managed Service for Prometheus Workspace
The ‘aws amp create-workspace‘ command creates an Amazon Managed Service for Prometheus workspace with the alias ‘AMP-EKS‘ in the specified AWS region. The workspaces provide isolated environments for storing Prometheus metrics and dashboards. The workspace is created with default settings which can be further customized if needed. The call returns the ID of the newly created workspace. This ID is required for sending metrics data to the workspace from applications as well as for allowing other services to access the data.
Step 4: Create Amazon Managed Grafana workspace
Create an Amazon Managed workspace compatible with Grafana version 9 by following the instructions here. Also you can choose to assign users as “admin” to the workspace. Lets get the Grafana workspace ID using the below command
Create an API Key with ADMIN access for calling Grafana HTTP APIs using these instructions and store it in AMG_API_KEY variable. Store the parameter in the Systems Manager parameter store as below
Step 5: Deploy the solution using CloudFormation
Create an S3 bucket, get the solution files from the GitHub repo and upload to S3 using the below commands:
Uploaded files from S3 looks like below.Note the URL of eks-monitoring-cfn-template.json as we will need this in the next steps.
Fig 2. S3 bucket showing the Solution files
You can provision the solution using CloudFormation via the CLI like so:
The other option is to use the AWS Console and go to CloudFormation → Create Stack and enter the values like below, providing the values to create the resources:
Fig 3. CloudFormation screen showing sample values
Creating the stack take around 20 minutes to complete. After the stack creation is complete, you must configure the Amazon EKS cluster to allow access from the newly created scraper. You can get the Scraper ID from your EKS cluster’s Observability tab. Use this ID, and follow these instructions to configure your Amazon EKS cluster for managed scraping.
Step 6: Solution overview
Once the steps the completed, log into your Amazon Managed Grafana workspace and under Dashboards, you should be able to view various dashboards under “EKS Infrastructure Monitoring” as below. This has both Infrastructure as well as workload related dashboards.
Fig 4. Amazon Managed Grafana dashboards
The Cluster dashboard under Computer Resources shows the various metrics related to the cluster as below. As you can see the CPU utilization is low since not much workloads are running
Fig 5. Amazon Managed Grafana Dashboard showing Cluster view
The Namespace(workload) dashboard provides similar information. You can thinks of this as parallel to what you might be viewing from the CloudWatch Container Insight’s Namespace view.
Fig 6. Amazon Managed Grafana Dashboard showing Namespace view
Same is the case with workload view
Fig 7. Amazon Managed Grafana Dashboard showing workloads view
You will also get Control plane views as well like below with the kube-apiserver view, which shows the advanced kubeapi-server metrics
Fig 8. Amazon Managed Grafana Dashboard showing advanced kube-apiserver view
Also you will be getting the Kube-apiserver troubleshooting view as well like below, which will be helpful during the troubleshooting activities for your cluster.
Fig 9. Amazon Managed Grafana Dashboard showing troubleshooting kube-apiserver view
Also the kubelet dashboard view as well
Fig 10. Amazon Managed Grafana Dashboard showing Kubelet view
And last but not least, Node dashboard view looks like below which shows the CPU and load average. Again since not much workloads are running now, the charts does not show lot of variation. These various dashboards tracks a total of 88 metrics and the full list of metrics is documented here.
Fig 11. Amazon Managed Grafana Dashboard showing Nodes view
Using the solution for performance monitoring
Let us deploy some workload and load test to see the anticipatory capabilities. For this, we launch a Java application consisting of a Kubernetes Deployment and Service, using the Amazon Correto JDK:
Now let us stress-test this deployment using the wrk tool as below. This will spin up 64 threads creating 2,048 connections for a period of 15 minutes, targeting the service we created in the previous step.
After this, you should be able to see the CPU and Load average spiking up as below from the Nodes dashboard
Fig 12. Amazon Managed Grafana Dashboard Node view with CPU utilization
Same way, from the Cluster dashboard also we should be able to see the hight CPU utilization as below.
Fig 13. Amazon Managed Grafana Dashboard Cluster view with CPU utilization
Cleanup
Use the following commands to delete resources created during this post:
Costs
This solution leverages AWS managed services, including Amazon Managed Grafana and Amazon Managed Service for Prometheus, to provide comprehensive monitoring and observability for your Amazon EKS clusters. While these services offer convenience and ease of use, it’s important to note that you will incur standard usage charges. These charges include costs associated with Amazon Managed Grafana workspace access by users, as well as metric ingestion and storage within Amazon Managed Service for Prometheus. The number of metrics ingested, and consequently the associated costs, will depend on the configuration and usage of your Amazon EKS cluster. You can monitor the ingestion and storage metrics through CloudWatch, as detailed in the Amazon Managed Service for Prometheus User Guide. Additionally, AWS provides a pricing calculator to help estimate the costs based on the number of nodes in your EKS cluster, which directly impacts the metric ingestion volume.
Conclusion
The AWS-managed solution for monitoring Amazon EKS clusters with Amazon Managed Grafana and Amazon Managed Service for Prometheus offers a comprehensive and streamlined approach to gaining deep insights into your Kubernetes infrastructure. By leveraging pre-configured dashboards and automated metric collection, you can effortlessly monitor the health and performance of your control and data planes, workloads, and resource utilization across namespaces. This solution empowers you with both anticipatory and corrective capabilities, enabling you to stay ahead of potential issues, optimize resource allocation, and troubleshoot problems quickly and effectively.
Throughout this walkthrough, you’ve learned how to set up the necessary components, including an EKS cluster, Managed Prometheus workspace, and Managed Grafana workspace. You’ve also deployed the CloudFormation template, which orchestrates the integration of these services, providing you with a unified monitoring solution tailored for your Amazon EKS environment. With the ability to visualize and analyze a wide range of metrics, from cluster-level metrics to workload-specific insights, you can make informed decisions, ensure optimal performance, and maintain a healthy and efficient Kubernetes ecosystem.
We’re looking forward to hear from you about how we can improve this solution. For example, by adding support for logs, alerts, traces, monitoring a fleet of EKS clusters, correlating telemetry, additional ways to provision the solution (for example, Terraform), and really anything else that comes to mind.
To learn more about AWS Observability, see the following references:
• AWS Observability Best Practices Guide
• One Observability Workshop
• Terraform AWS Observability Accelerator
• CDK AWS Observability Accelerator