AWS Cloud Operations Blog
Adding metrics and traces to your application on Amazon EKS with AWS Distro for OpenTelemetry, AWS X-Ray and Amazon CloudWatch
In order to make a system observable, it must be instrumented. This means that code to emit traces, metrics and logs must be added to the application either manually, with libraries, or with automatic instrumentation agents. Once deployed, the instrumented data from the application will be sent to the respective backend. There are a number of observability backends available and the way, which code is instrumented, varies from solution to solution.
In the past, this meant that there was no standardized data format available for sending data to an observability backend. Additionally, if you chose to switch observability backends, you had to re-instrument your code and configure new agents to be able to emit telemetry data to the new destination of your choice.
The OpenTelemetry (OTEL) project’s goal is to provide a set of standardized SDKs, APIs, and tools for ingesting, transforming, and sending data to an observability backend. AWS Distro for OpenTelemetry (ADOT) is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. With AWS Distro for OpenTelemetry, you can instrument your applications just once to send correlated metrics and traces to multiple monitoring solutions. AWS Distro for OpenTelemetry consists of SDKs, auto-instrumentation agents, collectors and exporters to send data to backend services.
In this blog post, we will introduce a sample application written in Python, the PetAdoptionsHistory microservice, to demonstrate how to add distributed tracing and metrics to your applications using OpenTelemetry Python client SDKs. We will explain on how you can use AWS Distro for OpenTelemetry (ADOT) to send the traces to AWS X-Ray, and metrics to Amazon CloudWatch. Amazon CloudWatch is a monitoring and observability service. Amazon CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. Amazon CloudWatch collects monitoring and operational data in the form of logs, metrics, and events.
In this blog, we will use Amazon CloudWatch Service Lens, one of the Amazon CloudWatch’s capabilities, to provide us with a visual representation of the components that make up our application and how they are connected. Additionally, we can quickly drill down from the map nodes into the related metrics, logs and traces. In particular, we will leverage CloudWatch Service Lens to view the application architecture before and after adding instrumentation code to the PetAdoptionsHistory microservice and to review relevant metrics regarding on the success- and error-rates of our service.
We will in particular highlight how to leverage CloudWatch Logs Insights to interactively search and analyze log data generated by our application and we will create an Amazon CloudWatch dashboard group relevant application metrics into one single visual dashboard.
The full code and step-by-step instructions are available directly in the One Observability Workshop. This blog post will highlight the most important parts and explain the relevant concepts on how to instrument the PetAdoptionsHistory microservice using OpenTelemetry Python client SDKs.
Architecture overview
The PetAdoptionsHistory microservice is written in Python and runs in a container on Amazon Elastic Kubernetes Service (EKS). The AWS Distro for OpenTelemetry (ADOT) collector is deployed in the same cluster and receives traces from the application. The collector is also configured to periodically scrape metrics from the application’s /metrics endpoint using HTTP.
The AWS Distro for OpenTelemetry (ADOT) collector is configured to publish traces to AWS X-Ray and sends metrics to Amazon CloudWatch.
Later in this blog, we will elaborate on the OpenTelemetry collector configuration to explain how the collector obtains its metrics and traces from the application and which services it publishes the collected data to.
Solution Walkthrough
Application overview
We will focus on a microservice called PetAdoptionsHistory. This microservice is part of a larger application, the PetAdoptions application, a web application that can be used to adopt pets. Each time a pet is adopted, a transaction is recorded in an Amazon Aurora PostgreSQL database.
The PetAdoptionsHistory microservice exposes APIs to query the transaction details and also clean up the historic data. Calls to this new service are made from the PetSite front-end service. Another service, the traffic generator, simulates human interactions with the PetSite website by periodically making calls to the front-end. The calls to the front-end in turn result in calls to the PetAdoptionsHistory service, causing the service to be called to either return the recorded list of adoptions, or to clear the list of transactions from the database.
The PetAdoptionsHistory application uses:
Flask
to handle incoming requests from an AWS Application Load Balancer to the applicationpyscopg2
to handle connectivity to the associated Amazon Aurora PostgreSQL database
The PetAdoptionsHistory application is deployed on an Amazon Elastic Kubernetes Service (EKS) cluster. Here’s an overview of the Amazon CloudWatch service map before adding instrumentation to the service. The PetAdoptionsHistory does not yet appear on this diagram.
Adding distributed tracing to the PetAdoptionsHistory microservice
Distributed tracing allows you to have deep visibility on the requests across those services and their backends (databases, external services) using correlation IDs. To get started with OpenTelemetry SDK on Python, we have to import the OpenTelemetry tracing libraries and follow initialization steps. Let’s break down the next steps.
The OpenTelemetry SDK needs a pipeline to define how traces and metrics flow through the application. Currently, AWS X-Ray requires a specific format for traces. For the Python SDK, this is done with the AWSXRayIdGenerator
. The tracing pipeline looks like the following (import statements omitted for brevity).
To actually generate traces data, we need to instrument portions of the application. The following snippet allows to capture incoming HTTP requests from the Application Load Balancer.
To capture databases transaction, you can instrument the psycopg2
library.
To give a service name with your captured traces, you can use the resources and service name attributes. On X-Ray you will see your service appear with the name below.
With the instrumentation above, you have visibility on the databases and HTTP transactions automatically. Additionally, with custom spans, you can instrument any portion of your application. For example, the following snippet creates a span called transactions_delete
for the DELETE HTTP calls that clean up the adoptions history database table.
You can find the full code snippet in the GitHub repository of the Observability Workshop.
Adding custom metrics to the PetAdoptionsHistory microservice
Instrumentation makes troubleshooting easier and metrics give a reliable way to see how your service is operating. With metrics, you can create alarms and get notified on anomalies that can occur based on predefined thresholds. More than 100 AWS services publish metrics to Amazon CloudWatch automatically, at no additional cost. Services will publish metrics to Amazon CloudWatch to give you insights about their usage. For example, when using AWS Application Load Balancer, you get Amazon CloudWatch metrics like HTTPCode_Target_2XX_Count
, which will give you the number of HTTP response codes generated by the ALB targets.
To better understand your application, you can emit custom metrics that are based on the application’s business logic and create alerts based on relevant business criteria. One popular and effortless way to achieve that is through Prometheus. Prometheus is an open-source, metrics-based monitoring system. It has a simple yet powerful data model and a query language that lets you analyze how your application and infrastructure is performing.
OpenTelemetry and Prometheus provide libraries to generate custom metrics with minimal efforts such as the number of HTTP responses codes, broken down by endpoints, dynamically, or allow adding those business metrics. The OpenTelemetry SDK for metrics setup looks like the following (import statements omitted for brevity).
In addition to the default Prometheus metrics, in our application, we chose to track the total number of business transactions using the transactions_get_counter
variable.
Metrics can be of different types in Prometheus. Counter metrics are used for measurements that only increase, which means their value can only go up. The only exception is when a counter is restarted, then it is reset to zero. Gauges on the other hand are used for measurements that can arbitrarily increase or decrease. Examples for gauges are temperature, CPU utilization, memory usage, the size of a queue and so on. In Python, gauges can be defined with a callback function, which returns the value of the gauge at the time it is invoked.
AWS Distro for OpenTelemetry
In this example, we used the AWS Distro for OpenTelemetry Collector to collect traces and metrics and send them to AWS X-Ray, and Amazon CloudWatch. This is achieved using the OpenTelemetry configuration components. These components once configured must be enabled via pipelines which defines the data flow within the OpenTelemetry Collector. In the sections below, we will explain the pipeline for our application which uses three components:
- Receivers: a receiver, which can be push or pull based, is how data gets into the Collector.
- Processors: processors are run on data between being received and being exported.
- Exporters: an exporter, which can be push or pull based, is how you send data to one or more backends/destinations.
Configuration for AWS X-Ray
AWS X-Ray provides a complete view of requests as they travel through your application and visualizes data across payloads, functions, traces, services, APIs. With AWS X-Ray you can analyze your distributed traces and understand your overall system. Learn more about AWS X-Ray in the CloudWatch ServiceLens Map section of the workshop.
This receiver configuration will expect the application to send traces data to one of those endpoints below using gRPC or HTTP.
Sending traces to AWS X-Ray will be configured with the awsxray
exporter defined below. Check out more advanced configurations options such as AWS Region or proxy in the Getting Started with X-Ray Receiver in AWS OpenTelemetry Collector section of the AWS Distro for OpenTelemetry (ADOT) website.
Configuration for Amazon CloudWatch metrics
In our example, we use Prometheus metrics from the application collected by the AWS Distro for OpenTelemetry Collector that we send to Amazon CloudWatch metrics.
With the receiver configuration below, the AWS Distro for OpenTelemetry collector scrapes via an HTTP call every 20 seconds on the dedicated path for the Prometheus metrics (see the instrumentation section above). As AWS Distro for OpenTelemetry supports Prometheus configurations, we use the service discovery mechanisms to collect environment information such as the Kubernetes container and pod name.
To send these metrics to Amazon CloudWatch, we have configured the awsemf
exporter which uses CloudWatch embedded metric format (EMF). EMF is a JSON specification used to instruct Amazon CloudWatch Logs to automatically extract metric values embedded in structured log events. This allows Prometheus-format metrics to be transformed into Amazon CloudWatch metrics.
In this snippet, we show how these metrics will be created under the PetAdoptionsHistory
namespace (container for metrics) on Amazon CloudWatch metrics. The transactions_get_count_total
metric will be associated two dimensions which are the pod_name
and container_name
.
Pipeline definition
To tie everything together, the OpenTelemetry configuration needs a pipeline under the service
definition. For our example it looks as follows.
The entire OpenTelemetry Collector configuration can be found in the link provided.
Results
With instrumentation enabled within the application and deployed alongside the AWS Distro for OpenTelemetry collector, tracing data will flow to AWS X-Ray and metrics to Amazon CloudWatch.
CloudWatch Service Map
The CloudWatch Service Map displays your service endpoints and resources as “nodes” and highlights the traffic, latency, and errors for each node and its connections. You can choose a node to see detailed insights about the correlated metrics, logs, and traces associated with that part of the service. The end-to-end view of your application helps you to pinpoint performance bottlenecks and identify impacted users more efficiently. This enables you to investigate problems and their effect on the application.
Here is the updated service map with the PetAdoptionsHistory
node and its connections.
Highlighting the PetAdoptionsHistory
service in the Service Map will reveal its connections to other entities.
Selecting the PetAdoptionsHistory
node on the map allows us to view relevant metrics such as latency, requests and faults associated to this service, alongside a node map with the service and its connections.
Selecting a trace, we view not only the transactions_delete
setup above, but also the origin of the transaction all the way from the front-end website (PetSite
).
Amazon CloudWatch Logs Insights
Amazon CloudWatch Logs Insights enables you to interactively search and analyze log data in Amazon CloudWatch Logs. This can be useful for troubleshooting purposes, for example node level metric spikes could provide further insights into task level errors. With Amazon CloudWatch embedded metric format (EMF) being written to Amazon CloudWatch Logs, we can leverage Amazon CloudWatch Logs Insights to investigate when and how often a metric has been larger than a given threshold. In our example, our query filters for transactions_history_count > 90
.
Amazon CloudWatch Metrics
The Amazon CloudWatch metrics explorer organizes all metrics collected from the application inside the PetAdoptionsHistory
namespace.
Now that we have both traces and metrics data, we can now create Amazon CloudWatch dashboards to have centralized visibility on how the application is performing.
Conclusion
AWS Distro for OpenTelemetry offers multiple possibilities to manage your observability data. In this post, we have shown you how to use OpenTelemetry client SDKs to instrument your applications. We have configured AWS Distro for OpenTelemetry collector on Amazon EKS to send the application traces to AWS X-Ray and the application metrics to Amazon CloudWatch. With this setup, you can correlate the metrics, logs and traces for your application using Amazon CloudWatch ServiceLens, an interactive map visualization service.
The One Observability Workshop lets you experiment with many AWS observability services. You can use the workshop in an AWS-led event with an account provisioned for you. You can also run it your account at your own pace. To run the PetAdoptionsHistory microservice yourself or explore other Amazon CloudWatch features such as Contributor Insights, Logs Insights and more, review the respective sections of the One Observability Workshop.
About the authors: