AWS Cloud Operations Blog

Category: Learning Levels

Automate the sending of AWS Audit Manager assessment reports

Implementing compliance at scale is not an easy endeavor for customers as they move their workloads to the AWS cloud. Due to the challenges that are posed by cloud environments such as the more ephemeral nature of resources or the dynamic landscape of the cloud, automation is paramount to success. At an enterprise scale the […]

Operationalizing CloudWatch Anomaly Detection

In this post, you’ll explore Amazon CloudWatch anomaly detection and set it up using the AWS Console, the AWS Command Line Interface (AWS CLI), and AWS CloudFormation. We also review some best practices when using CloudWatch anomaly detection. CloudWatch alarms allow you to watch CloudWatch metrics and receive notifications when the metrics fall outside of […]

Enhancing DevOps Practices with Amazon CloudWatch Application Performance Monitoring

Organizations seeking to deliver meaningful technology services at a higher velocity to their customers have incorporated application performance monitoring (APM) into their DevOps operating models. Software development and IT operations teams that have traditionally worked in their own silos now strive to work in concert to increase organizational agility. The transformation path is unique for […]

Amazon Managed Service for Prometheus adds support for 200M active metrics

Today, Amazon Web Services (AWS) is pleased to announce support for 200M active series per workspace for Amazon Managed Service for Prometheus. Amazon Managed Service for Prometheus is a fully managed Prometheus-compatible monitoring service that monitors and alarms on operational metrics at scale. It does this without you having to manage the underlying infrastructure required […]

Build Cloud Operations Skills Using The New Getting Started with AWS Systems Manager Training

Are you looking for a solution that would help you simplify your operational tasks? Do you want to automate regular operational activities? Would you like simplify patching your running instances? Those topics and more are covered in our Getting Started with AWS Systems Manager training available today. AWS Systems Manager centralizes operational data from multiple […]

Visualizing Amazon CloudWatch Costs – Part 2 – Where does the data come from?

In part 1 of this series we explored an Amazon CloudWatch dashboard which provides a real-time view of some of the typical main contributors to CloudWatch costs. In this second post, we’ll look at how the CloudWatch dashboard widgets were created so that you can learn how to create something similar, or modify the widgets […]

Visualizing Amazon CloudWatch Costs – Part 1

Amazon CloudWatch monitors your AWS resources and the applications you run on AWS in real-time. You can use CloudWatch to collect metrics, logs, traces, set up alarms, create synthetic checks, and more. The information you collect lets you observe, validate, and alert on areas of interest to you. In this two-part post, we’ll explore a […]

Avoid patching failures due to low disk space with AWS Systems Manager Automation and CloudWatch alarms.

Every organization has to comply with keeping their fleet updated on patching and ensure that business and workloads are not affected due to patching. One of the challenges for the operations teams is to patch at scale without affecting production software. The most common reasons workloads patching fails are insufficient disk space, a spike in […]

Improving Mergers & Acquisitions IT Integration with AWS Application Discovery Service

The purpose of this post is to provide high-level guidance for Mergers & Acquisitions (M&A) stakeholders on how to incorporate AWS Application Discovery Service as part of integration planning and integration data discovery. This post is part of a series of technical content on how M&A integration teams can utilize Amazon Web Services (AWS) to […]

The Importance of Key Performance Indicators (KPIs) for Large-Scale Cloud Migrations

Key performance indicators (KPIs) are quantifiable measurements that help you understand how well you’re performing in specific areas. For example, from an incident management perspective, you may measure the mean time to recovery to understand how long it takes to recover following an incident. Large-scale enterprise migration programs (such as vacating a data center or […]