AWS Cloud Operations Blog

How Audible used Amazon CloudWatch cross-account observability to resolve severity tickets faster

This blog was co-written with Audible’s Apurva Jatakia, Kaushik S., and David Etler.

Audible’s consumption services platform serves thousands of requests every second, and each incoming request is served by a distributed set of microservices owned by different teams. An Audible team, in charge of a platform called Stagg, is responsible for five separate microservices. The Audible Stagg team supports the player and library projects and powers experiences on Audible’s mobile app and website. Each AWS service generates its own logs and metrics, and until last year, the Audible Stagg team did not have a holistic view of how requests were flowing through the service chain.

The absence of a unified observability platform means on-call engineers do not have a tool to trace the request/response/exceptions for a request across the services, increasing the time that is spent to identify and solve customer-facing tickets. Customers who have hundreds of AWS accounts can benefit from learning how Audible was able to correlate metrics, logs, and traces and create a unified observability solution providing a single-pane-of-glass view across a set of microservices.

In this blog post, we show how a team at Audible implemented a unified observability solution using Amazon CloudWatch cross-account observability that helped them become more efficient, saving 60% debugging time; as well as achieving greater developer satisfaction.

Challenges Audible Stagg was facing

  • Inability for on-call engineers to link metrics/alarms to any associated request logs
  • Inability to trace a request across services
  • Inability to correlate interactions in the app with service traces
  • Lack of holistic view of AWS services across Audible Stagg platform

When triaging customer facing issues, Audible had to transfer the tickets from one team to another until the root cause was found. Even within a set of microservices owned by a single team, the absence of a tracing solution meant engineers had to spend a significant amount of time analyzing logs of each service in the chain to ultimately build a holistic view.

Additionally, it was difficult to correlate issues reported by customers with the associated service requests. This resulted in Customer Service personnel having to request additional information from the customer to gather enough details to be able to reproduce the issue, as well as a significant amount of developer time spent on reproducing the issue.

Solution requirements

  • Ability to trace a request across all Audible Stagg services, as well as the Audible app and other external clients
  • Ability to query and drill into a trace to see AWS services associated with an application that failed and the root cause exception
  • Provide an easy way for on-call engineers to link traces to logs, when further analysis is required
  • Provide a single-pane-of-glass view for AWS services associated with an application spanning multiple AWS accounts
  • Support a variety of service frameworks (Java, Node.JS) and compute platforms (EC2, ECS, and Lambda)
This figure shows architecture diagram of services used by Audible Stagg application

Figure 1: Audible Stagg service overview

Key Decisions

Amazon CloudWatch cross-account observability

Audible Stagg started using Amazon CloudWatch cross-account observability for cross-account tracing, logging, and metrics. Multiple source accounts feed into one monitoring account. The number of source accounts can scale up to 100,000. The current service quotas can be found here. A monitoring account is a central AWS account that can view and interact with observability data generated across other accounts. A source account is an individual AWS account that generates observability data for the resources that reside in it.

Source accounts share their data with the monitoring account. Cross-account tracing aggregates traces from multiple source accounts into a single monitoring account. This enables a complete view of requests that travel across multiple accounts. You can view cross-account traces in the AWS X-Ray service map and traces pages within the CloudWatch console.

High level architecture-of cross account observability

Figure 2: Cross-account observability diagram

AWS X-Ray

AWS X-Ray is a distributed tracing solution which provides an easy way to trace requests across the service chain. Audible Stagg onboarded their services to X-Ray, which allowed them to trace a request across services. Since X-Ray requires little infrastructure to maintain; services simply needed to onboard to X-Ray’s agent to get started.

High level summary of how X-Ray works:

  • X-Ray links (or “correlates”) requests together by using a “trace id”. Trace ID is generated by the service which serves a request first and is propagated to downstream services.
  • Once the X-Ray agent is setup and necessary permissions are given to publish X-Ray traces, you can use AWS X-Ray UI to query and select specific traces to debug the issue.
X-Ray overview

Figure 3: AWS X-Ray overview diagram

Roadmap to solution

High-level solution architecture

High level solution architecture of Audible Stagg's cross-account observability solution

Figure 4: High-level solution architecture

Features of the solution

With the solution, you can create a centralized AWS observability account for collecting the logs, traces, and metrics, and get a global view of the data. Here are the main features of the solution.

  • Using AWS X-Ray for tracing: The solution deploys AWS X-Ray agents in the services across AWS accounts. AWS X-Ray supports applications running on EC2, ECS, Lambda, Amazon SQS, Amazon SNS and Elastic Beanstalk. In addition, the X-Ray SDK automatically captures metadata for API calls made to AWS services using the AWS SDK. X-Ray tracks requests flowing through applications or services across multiple regions. X-Ray data is stored locally to the processed region but with enough information to enable client applications to combine the data and provide a global view of traces. The X-Ray agent can assume a role to publish data into an account different from the one in which it is running for EC2 and ECS. This enables publishing data from various components of the application into a central account.
  • Using CloudWatch for log collection: The CloudWatch Logs Agent will send log data to Amazon CloudWatch in each service’s AWS account. Cross-account logging enables us to view all these logs in our centralized monitoring account.
  • Using CloudWatch for metrics monitoring: Amazon CloudWatch allows you to monitor AWS cloud resources and the applications you run on AWS. Metrics are provided automatically for a number of AWS products and services, including Amazon EC2 instances, EBS volumes, Elastic Load Balancers, Auto Scaling groups, EMR job flows, RDS DB instances, DynamoDB tables, ElastiCache clusters, RedShift clusters, OpsWorks stacks, Route 53 health checks, SNS topics, SQS queues, SWF workflows, and Storage Gateways.
  • Amazon CloudWatch ServiceLens: You can get a unified view of X-Ray and Cloudwatch metrics and logs using Amazon CloudWatch ServiceLens, that helps you visualize and analyze the health, performance, and availability of your applications in a single place. CloudWatch ServiceLens ties together CloudWatch metrics and logs as well as traces from AWS X-Ray to give you a complete view of your applications and their dependencies. This enables you to quickly pinpoint performance bottlenecks, isolate root causes of application issues, and determine users impacted. CloudWatch ServiceLens enables you to gain visibility into your applications in three main areas: Infrastructure monitoring (using metrics and logs to understand the resources supporting your applications), transaction monitoring (using traces to understand dependencies between your resources), and end user monitoring (using canaries to monitor your endpoints and notify you when your end user experience has degraded).

Outcome

Audible Stagg was able to leverage all X-Ray features including the service map and traces. They were able to access all these features from a single monitoring account, even though the underlying services are spread across many AWS accounts.

Service map showing holistic view of Audible Stagg’s services in the monitoring account

Service map showing holistic view of Audible Stagg’s services in the monitoring account

Figure 5: Dashboard that shows service map

Drilling into a node on the service map

Drilling into a node on the service map

Figure 6: Dashboard showing drilling into a node

Metrics and trace correlation

Metrics and trace correlation

Figure 7: Dashboard showing metrics and traces correlation

Log insights
For any trace, Audible Stagg was able to easily pull up the corresponding logs for that trace, from any of the services. They were able to access these logs from the shared monitoring account.

Log insights

Figure 8: Dashboard that shows log insights

Where Audible Stagg is today

With the implementation of cross-account observability, Audible’s Stagg team can now just log into one centralized account to identify the issue. This has saved them 60% debugging time that was earlier spent on triaging high-severity issues. They are now able to access logs, metrics, and traces for all their services in a centralized account.

Another benefit the team has seen is increased developer satisfaction. Leveraging X-Ray in conjunction with CloudWatch has allowed developers to tackle issues more quickly and with higher confidence. X-Ray allows the Stagg developers to query their services under a single AWS account, and this capability has cut down time and effort in having multiple log windows open or having to constantly sign in and out between services and AWS accounts.

The observability solution met Audible Stagg team’s need and they plan to help onboard other Audible teams to the solution. They would also be leveraging further features of X-Ray down the line, like custom annotations, and plan to expand use of X-Ray for other use cases, such as QA and customer care bug reporting.

Conclusion

In this post, we saw how Audible implemented their unified observability solution which helped provide a rich cross-account observability and discovery experience for their metrics, logs, and traces. Cross-account functionality is integrated with AWS Organizations to help you efficiently build your cross-account dashboards. The Audible team has been continuously working towards including more services and accounts to use with the cross-account observability solution.


About the Authors:

Tulip Gupta

Tulip is a Senior Solutions Architect at Amazon Web Services. She works with Amazon media and entertainment customers, including Audible, Prime Video, and Amazon Music to design, build, and deploy technology solutions on AWS. She assists customers in adopting best practices while deploying solutions in AWS. Linkedin

Kaushik S

Kaushik is a software development engineer at Audible. He works in designing/architecting highly scalable and maintainable software serving millions of software. Currently, he is working on exposing a server driven UI for Audible apps to interact with and get view model data constructed after aggregating data from various data sources.

Apurva Jatakia

Software engineering leader experienced in managing distributed technical teams in the development and maintenance of complex software products and infrastructure. SAFE Certified Scaled agilest experienced in standing up multiple agile teams and streamlining work across them. Capable of building software platforms with improved functionality and productivity, consistently meeting critical operational requirements.

David Etler

David is a Software Dev Engineer III at Audible. He works as a tech lead on Audible’s “Stagg” team. Stagg is a server-driven UI and headless CMS platform which powers many experiences on Audible’s mobile apps, and some experiences on Audible’s website.