AWS Cloud Operations Blog
Automating Amazon EC2 Instances Monitoring with Prometheus EC2 Service Discovery and AWS Distro for OpenTelemetry
Traditionally, scraping application Prometheus metrics required manual updates to a configuration file, posing challenges in dynamic AWS environments where Amazon EC2 instances are frequently created or terminated. This not only proves time consuming but also introduces the risk of configuration errors, lacking the agility necessary in dynamic environments.
In this blog post, we will demonstrate how Prometheus service discovery, particularly EC2 service discovery, can help overcome these challenges providing the following benefits:
- Automatic target discovery
- Reduced manual effort and enhanced agility
- Minimized configuration errors
We will showcase how to configure AWS Distro for OpenTelemetry (ADOT) collector to perform EC2 service discovery in order to dynamically identify the EC2 targets for scraping Prometheus metrics. Subsequently, we are going to simulate a dynamic environment to showcase how EC2 service discovery automatically updates the list of targets to be scraped. We will collect the Prometheus metrics using Amazon Managed Service for Prometheus workspace and visualize them using Amazon Managed Grafana.
Solution Overview
To showcase the dynamic discovery of EC2 instance targets using EC2 service discovery, we are going to provision the following resources through AWS CloudFormation:
- AWS Distro for OpenTelemetry (ADOT) collector running on EC2 instance with name
ADOT_COLLECTOR
to scrape Prometheus metrics. - Two Amazon EC2 instances with name
APP_SERVER
launched by an AWS AutoScaling Group (ASG) namedApplicationASG
. They will be configured to runnode_exporter
to expose OS level Prometheus metrics. - The ADOT collector is configured to dynamically identify these targets using EC2 service discovery and filter them based on
tag-key=service_name
andtag-value=node_exporter
. - An Amazon Managed Service for Prometheus and Amazon Managed Grafana workspace.
Figure 1: Solution Architecture
Prerequisites
- Before starting, make sure you have AWS CloudShell, a browser-based shell setup into your AWS account and region to run the commands described in this blog post.
- (Optional) We will be configuring user access through AWS IAM Identity Center for Amazon Managed Grafana workspace. Make sure you have enabled IAM Identity Center in your AWS account.
Solution Walkthrough
To deploy the architecture shown in Figure 1, please follow the below steps:
- From the AWS CloudShell command line interface, enter the below commands to clone the sample project from the
aws-samples
GitHub repository.git clone https://github.com/aws-samples/amazon-ec2-dynamic-monitoring-with-prometheus-service-discovery.git cd amazon-ec2-dynamic-monitoring-with-prometheus-service-discovery/templates
- Next, to provision the resources, enter the following command. Replace the
<aws-region>
with your AWS Region name.AWS_REGION=<aws-region> aws cloudformation create-stack --stack-name adot-ec2-service-discovery-demo --template-body file://adot_ec2_service_discovery_cfn.yml --capabilities CAPABILITY_IAM --region $AWS_REGION
Setting up Amazon Managed Grafana Workspace
A managed Grafana workspace has been already created using AWS CloudFormation. Next you need to set up the following two configurations on this workspace:
- Amazon Managed Grafana lets you to configure user access through AWS IAM Identity Center or other SAML based Identity Providers (IdP). In this post, we’re using the AWS IAM Identity Center option with Amazon Managed Grafana. To set up Authentication and Authorization, follow the instructions in the Amazon Managed Grafana User Guide for enabling AWS IAM Identity Center.
Figure 2: Example of Amazon Managed Grafana user access using AWS SSO.
- Further, follow these steps to configure Amazon Managed Service for Prometheus as a data source for this Amazon Managed Grafana workspace.
Figure 3: Configuring Amazon Managed Prometheus as data source for Amazon Managed Grafana
Visualizing Prometheus Metrics with Amazon Managed Grafana
Now, let’s visualize the Prometheus metrics that have been pushed by the ADOT collector to the Amazon Managed Service for Prometheus workspace.
Navigate to Amazon Managed Grafana workspace from your AWS Management Console, choose the Workspace URL to sign in to your Grafana dashboard. As demonstrated in Figure 4, we are visualizing the Prometheus metric node_cpu_seconds_total
for all the EC2 target instances that were dynamically discovered by the ADOT collector agent using EC2 service discovery.
Figure 4: Visualizing Prometheus metrics of dynamically scrapped targets
Additionally, you can visualize Prometheus metrics for individual EC2 instance targets by utilizing the instance_id
label, as shown in Figure 5.
Figure 5: Visualizing Prometheus metrics of specific scrapped target
Simulating Dynamic EC2 Environment
To simulate a dynamic environment, we will increase the “Desired capacity” of the ApplicationASG
Auto Scaling Group. Currently, this ASG is configured with a minimum size of 2, a maximum size of 4, and a desired capacity of 2. We will adjust the Desired capacity
value from 2 to 4. Please follow the below steps to change this parameter:
Steps:
- Navigate to AWS CloudShell console.
- Run the following AWS CLI command in the terminal:
ASG_NAME=$(aws cloudformation describe-stacks --stack-name adot-ec2-service-discovery-demo --region $AWS_REGION --query 'Stacks[0].Outputs[?OutputKey==`ASG`].OutputValue' --output text) echo $ASG_NAME aws autoscaling set-desired-capacity --auto-scaling-group-name $ASG_NAME --desired-capacity 4 --honor-cooldown --region $AWS_REGION
Wait 2-5 minutes for the ADOT collector to identify the new EC2 targets launched by the ASG service. Then, navigate to your Amazon Managed Grafana console to visualize the associated Prometheus metrics for these targets (see Figure 6).
Figure 6: Visualizing Prometheus metrics of newly launched targets
This showcases how ADOT collector leverages EC2 service discovery to identify newly added EC2 instances during scale-out activities in the Auto Scaling Group (ASG) and seamlessly collects Prometheus metrics from the newly identified targets, facilitating real-time monitoring and scalability within dynamic environments.
Let’s delve into how the ADOT collector manages to automatically identify these newly launched targets:
- The ADOT collector initiates a
DescribeInstances
API call, specifyingfilter
parameters to search for instances tagged withservice_name
as the key andnode_exporter
as the value. - The EC2 API responds with a filtered list of instances that meet the specified criteria. This updated list now includes the two recently launched instances from the ASG. This list is automatically refreshed based on the
refresh_interval
parameter. - The filtered targets will then be scraped by the ADOT collector in order to collect Prometheus metrics.
- Prometheus metrics are retrieved from the targets and substantially pushed to desired destination e.g. Amazon Service for Managed Prometheus in this scenario.
- Amazon Managed Grafana then queries Prometheus metrics from Amazon Managed Service for Prometheus.
Figure 7: Flow diagram of how ADOT collector performs EC2 service discovery
Design Considerations
Here are some key design aspects you should consider while configuring EC2 service discovery with the ADOT collector on Amazon EC2.
1. IAM Role Permissions
When deploying the ADOT collector in conjunction with EC2 service discovery, make sure EC2 IAM Role must be equipped with ec2:DescribeInstances
and ec2:DescribeAvailabilityZones
permissions.
2. DescribeInstances API Requests Limit
By default, the ADOT collector will refresh the list of EC2 instances every 60s by making DescribeInstances
API. You can configure the refresh_interval
option to control how frequently ADOT collector makes the API requests in order to update this list. An example of such configuration can be found below snippet:
# EC2 service discovery with refresh interval as 5 minutes
- job_name: 'node_exporter'
ec2_sd_configs:
- region: eu-west-1
- refresh_interval: 5m
Refer Request throttling for the Amazon EC2 API for more information.
3. Configuring EC2 Security Groups
EC2 Service Discovery uses the EC2 instance private IP Address by default to scrape Prometheus metric. In order for ADOT collector to successfully scrape EC2 instances in VPC make sure you allow ingress
traffic on port to scrape metrics under security group associated with your instances. For instance, if your application exposes Prometheus metrics via TCP port 9100, make sure to allow ingress traffic specifically on this port within the security group settings.
4. Tagging Strategies to Discover EC2 Instances
Tagging is a crucial aspect of effectively utilizing Prometheus EC2 service discovery. Employ essential metadata tags like Application
or Service Name
, Environment Name
, and Role
or Function
to streamline grouping and identification of instances. Additionally, implement hierarchical tags, such as tier
or cluster
to represent relationships and dependencies, facilitating organized monitoring.
These best practices empower selective and targeted discovery, ensuring efficient monitoring of EC2 instances in dynamic AWS environments. Further insights can be found under the Tagging Best Practices Whitepaper.
5. Scaling ADOT Collector
Below are some ADOT collector scaling strategies running on EC2 instances while scraping a large number of targets:
- Vertical Scaling: Initiate the scaling process by vertically expanding your ADOT Collector instance. This involves allocating more CPU and memory resources. You can accomplish this by modifying the EC2 instance type on which the ADOT collector runs.
- Sharding by Availability Zones (AZ): In cases where you are scraping metrics from a vast array of EC2 instances spread across multiple Availability Zones (AZ) within a VPC, consider sharding the ADOT collector instance per AZ. This approach evenly distributes the workload across multiple ADOT collector instances. Below snippet is an example ADOT configuration to achieve this:
# ADOT Collector configuration to scrape targets from specific Availability Zone "ap-south-1a" --- ec2_sd_configs: - region: ap-south-1 port: 9100 filters: - name: __meta_ec2_availability_zone values: - ap-south-1a relabel_configs: - source_labels: - __meta_ec2_instance_id target_label: instance_id
- Sharding by Metrics Type: Another sharding approach is based on the type of metrics you want to collect. For example, if you are running
node_exporter
to gather infrastructure-level metrics andjmx_exporter
to collect application-level metrics, you can distribute the collection of these metrics using two ADOT collector instances. Likewise, you can shard them based on the environment or application. Here’s a snippet of ADOT configuration to achieve this:# Scraping targets running jmx exporter by filtering using tag key "application" and value "JMX" --- ec2_sd_configs: - region: ap-south-1 port: 9999 filters: - name: tag:application values: - JMX relabel_configs: - source_labels: - __meta_ec2_instance_id target_label: instance_id
Cleaning up
To decommission all the resources deployed during walkthrough, navigate to AWS CloudShell command line interface and run the below command.
aws cloudformation delete-stack --stack-name adot-ec2-service-discovery-demo --region $AWS_REGION
Conclusion
In this blog post, we demonstrated how you can use EC2 service discovery with AWS Distro for OpenTelemetry (ADOT) collector in order to automatically identify targets for scraping Prometheus metrics from dynamic EC2 environments. This leads to significant reduction of time spent to manually maintaining list of targets and also mitigates the risk of configuration errors.
We also highlighted key design considerations aimed at enhancing operational efficiency and ensuring a more reliable monitoring process while using EC2 service discovery with ADOT collector. As a next step, we encourage you to try and customize this solution for your specific use cases in managing Prometheus metric scraping with ADOT collector in dynamic EC2 environments.
To learn more about AWS Observability services, please check the below resources:
- Hands-on experience with AWS Observability Workshop
- AWS Observability Best Practices Guide
- AWS Observability Accelerator for CDK
- AWS Observability Accelerator for Terraform
- Free course on AWS Skill Builder – Observability