AWS Cloud Operations Blog
Tag: Management Tools
Monitor EBS Detailed Performance Statistics with Amazon Managed Service for Prometheus
Today we are excited to announce that you can now easily ingest Amazon EBS detailed performance statistics from your Amazon Elastic Kubernetes Service (Amazon EKS) workloads into an Amazon Managed Service for Prometheus workspace. We recently announced the availability of EBS detailed performance statistics, which gives you real-time visibility into the performance of your EBS […]
Sign-in to AWS Console Mobile Application with an AWS Access Portal or third-party IdP URL
AWS customers rely on the AWS Console Mobile Application to monitor, manage, and receive notifications to stay informed about their AWS resources while away from their desktop devices. Customers who use Single-Sign-On (SSO) can face a unique set of challenges while signing into the AWS Console Mobile Application. While SSO can offer enhanced security and […]
Enhanced dashboard, latency suggestions in Amazon CloudWatch Internet Monitor
Amazon CloudWatch Internet Monitor provides near-continuous internet measurements for your internet traffic, including availability and performance metrics, tailored to your specific workload footprint on AWS. With Internet Monitor, you can get insights into average internet performance metrics over time, as well as get alerts for issues (health events). You’re notified about events that impact your end […]
Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library
Ensuring the reliability and resilience of applications is crucial for maintaining business continuity, delivering a superior customer experience, and staying compliant with industry regulations. As defined in the AWS Well-Architected Framework Reliability Pillar, testing reliability plays an important role in ensuring reliability. Chaos engineering is a powerful way to not only test how your systems […]
Managing access to AWS accounts from Microsoft Teams and Slack at scale using AWS Organizations and AWS Chatbot
Customers use chat collaboration applications like Microsoft Teams and Slack to collaborate and manage their AWS applications. AWS Chatbot is a ChatOps service that enables customers to monitor, troubleshoot issues, and manage AWS applications from chat channels. AWS Chatbot provides autonomy and customizability to DevOps teams operating their AWS environments on the go from chat […]
Accelerating migrations and IT Tasks for DKB using AWS Systems Manager
Deutsche Kreditbank AG (DKB), one of Germany’s largest direct banks with over five million customers. In 2023, DKB migrated their back-office IT infrastructure to Amazon Web Services (AWS). This Included their diverse infrastructure, backup, networking, and both Windows and Linux servers, while managing risks like downtime, data integrity, and security vulnerabilities. Customers in regulated industries […]
Serverless Governance of Software Deployed with AWS Service Catalog
AWS Service Catalog (Service Catalog) is a powerful tool that empowers organizations to manage and govern approved services and resources. It significantly benefits platform engineering by standardizing environments, accelerating service delivery, and enhancing security. With its automated provisioning and resource management, Service Catalog supports infrastructure as code, enabling scalable, reliable deployments. Platform engineering teams are […]
Elevating Your AWS Observability: Unlocking the Power of Amazon CloudWatch Alarms
Organizations commonly leverage AWS services to enhance the observability and operational excellence of their workloads. However, often it is unclear the actions that teams should take when observability metrics are delivered to them, it can be difficult to understand which metrics need action to remediate and which ones are simply noise. For example, if an […]
Understanding AWS High Availability and Replication for vSphere Administrators
Introduction vSphere HA is a fundamental and frequently used feature of vSphere. If any of several failure scenarios occur, it restarts a virtual machine. The failure scenarios range from VM or host crashes to unresponsive hosts (for example, due to network isolation or outage). Translating vSphere High Availability (HA) to the public cloud can be […]
Gain operational insights for NVIDIA GPU workloads using Amazon CloudWatch Container Insights
As machine learning models grow more advanced, they require extensive computing power to train efficiently. Many organizations are turning to GPU-accelerated Kubernetes clusters for both model training and online inference. However, properly monitoring GPU usage is critical for machine learning engineers and cluster administrators to understand model performance and to optimize infrastructure utilization. Without visibility […]