Containers
How Vanguard uses AWS X-Ray and Amazon CloudWatch to improve observability for Amazon ECS cloud applications
This post was contributed by Jeffrey Emberger, Technical Lead, The Vanguard Group and John Formento, Solutions Architect, AWS.
Cloud applications are changing the speed at which companies can deliver new capabilities for their customers. With increased speed comes the need to more quickly, reliably, and inexpensively observe cloud application health. Observability is no longer an afterthought because more companies rely on the web for their survival. Downtime of even a few minutes can result in loss in revenue, future customers, and reputation.
Observability must be accurate and timely. Development teams must be alerted to any latency, failure, or infrastructure issues in cloud applications before they lead to a critical business outage. Observability tools must provide a user-friendly experience to debug trace information when critical issues occur so the root cause can be identified as soon as possible.
In this blog post, I will share how a development team at Vanguard went from using third-party observability tools to using AWS X-Ray and Amazon CloudWatch on one of their applications deployed on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate.
The following services are covered in this post.
Amazon ECS
Amazon ECS is a highly scalable, fast container management service that makes it easy to run, stop, and manage containers on a cluster. Your containers are defined in a task definition that you use to run individual tasks or tasks within a service. In this context, a service is a configuration that enables you to run and maintain a specified number of tasks simultaneously in a cluster. You can run your tasks and services on a serverless infrastructure that is managed by AWS Fargate. Or, for more control over your infrastructure, you can run your tasks and services on a cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances that you manage.
AWS Fargate
AWS Fargate is a serverless compute engine for containers that works with Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS). Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation.
Fargate allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity. You only pay for the resources required to run your containers, so there is no overprovisioning and paying for additional servers. Fargate runs each task or pod in its own kernel, which provides the tasks and pods their own isolated compute environment. This enables your application to have workload isolation and improved security by design. This is why customers such as Vanguard, Accenture, Foursquare, and Ancestry have chosen to run their mission-critical applications on Fargate.
AWS X-Ray
AWS X-Ray helps developers analyze and debug production, distributed applications, such as those built using a microservices architecture. With X-Ray, you can understand how your application and its underlying services are performing, to identify and troubleshoot the root cause of performance issues and errors. X-Ray provides an end-to-end view of requests as they travel through your application. It shows a map of your application’s underlying components. You can use X-Ray to analyze applications in development and in production, from simple three-tier applications to complex microservices applications that consist of thousands of services.
Amazon CloudWatch
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. You can use CloudWatch to detect anomalous behavior in your environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.
Starting point
The Vanguard development team was responsible for supporting an application deployed on Amazon ECS and AWS Fargate that used third-party tools for monitoring and tracing. Alarms for the on-call developer were delayed because one of these tools had to receive logs from AWS to detect production issues. Another tool received trace information from AWS after executions occurred, but didn’t support custom logging or tracing. The development team wasn’t able to estimate how long it would take to resolve production issues.
Analysis
The development team looked for cloud-native observability tools that could increase reliability, accuracy, and timeliness. They reasoned that if they could find better tools, they could reduce business impact when their ECS cloud application experienced critical production issues.
The development team met with an AWS Solutions Architect who recommended AWS X-Ray and Amazon CloudWatch. The development team determined that those services would meet their observability needs and easily integrate with their ECS cloud application.
Implementation of AWS X-Ray
The development team decided to start by implementing AWS X-Ray in their ECS cloud application. In addition to the required AWS X-Ray dependencies, the team added a single custom annotation to their code that involved an input field into their ECS cloud application. They completed the coding changes without major issues and before long their ECS cloud application was successfully working with AWS X-Ray.
The development team was able to use AWS X-Ray a couple of days after their ECS cloud application was promoted to production. Some executions of their ECS cloud application resulted in no data being returned to consumer applications. Because the development team added that custom annotation, they were able to filter by the input in AWS X-Ray to get a list of input values that weren’t found in Amazon DynamoDB. Without AWS X-Ray, the development team would have had to add display statements in their ECS cloud application and go through ECS logs manually to find a list of input that was resulting in rows not being found in DynamoDB. It would have taken many hours to scour the ECS logs manually. AWS X-Ray produced the list automatically.
Implementation of Amazon CloudWatch
The development team used Amazon CloudWatch to create a centralized dashboard to view metrics and create alarms to notify the team of high response times in their ECS cloud application. Within a couple days, they had a basic dashboard with response-time metrics on their ECS cloud application. The development team created some log metric filters that include HTTP status codes (200, 3XX, 4XX, and 5XX) and execution counts so that this information could be displayed in graphs on their dashboard. Then they added alarms based on response times and HTTP status codes. In just a few weeks, the development team had a fully functioning CloudWatch dashboard that displays meaningful graphs and alarms.
The development team recently decided to build a second CloudWatch dashboard for the monitoring of their SLI/SLO metrics. CloudWatch can easily display percentiles (P50 and P95) for ECS cloud applications. The development team used the built-in percentiles to quickly set up graphs where they can monitor their SLI/SLO metrics.
Benefits
The development team is now notified by Amazon CloudWatch alarms in near real time, which makes it possible for them to respond to production issues more quickly. Compared to the third-party tool, Amazon CloudWatch offers more flexibility and makes it possible for the development team to fine-tune their alarms. Now that they are using AWS X-Ray, anyone on the team can research reports if, for example, the ECS cloud application isn’t returning data or input is taking a long time to execute.
Summary
The development team at Vanguard wanted to improve the observability on their ECS cloud application. Instead of continuing to use third-party tools that introduced latency and unreliability into their on-call process, the development team chose AWS X-Ray and Amazon CloudWatch. Within weeks, the development team was getting near real-time alerts when their ECS cloud application had an operational issue.
If you are interested in learning more about Vanguard’s implementation, watch their 2020 re:Invent session.