AWS Cloud Operations Blog
Introducing vended metrics for Amazon Managed Service for Prometheus
Today, I’m happy to announce that Amazon Managed Service for Prometheus now vends usage metrics to Amazon CloudWatch. These metrics can be used to help you gain better visibility into your Amazon Managed Service for Prometheus workspace. Let’s dive in to see how you could use these new Prometheus usage metrics in CloudWatch.
I‘ve set up a new workload consisting of two Amazon EC2 instances, each running Prometheus and remote writing metrics to an Amazon Managed Service for Prometheus workspace. Furthermore, within my workspace, I’ve set up some rules to alert on high or low CPU utilization. The alerting rules I’m using look like this:
groups:
- name: example
rules:
- alert: HostHighCpuLoad
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 60
for: 5m
labels:
severity: warning
event_type: scale_up
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 60%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: HostLowCpuLoad
expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) < 30
for: 5m
labels:
severity: warning
event_type: scale_down
annotations:
summary: Host low CPU load (instance {{ $labels.instance }})
description: "CPU load is < 30%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
I’ve also configured alert manager to send the alerts to an Amazon Simple Notification Service (Amazon SNS) topic. The alert manager configuration looks like this:
alertmanager_config: |
route:
receiver: default_receiver
repeat_interval: 5m
receivers:
- name: default_receiver
sns_configs:
- topic_arn: <arn of SNS topic goes here>
send_resolved: false
sigv4:
region: us-east-2
message: |
alert_type: {{ .CommonLabels.alertname }}
event_type: {{ .CommonLabels.event_type }}
Looking at the CloudWatch Usage metric namespace, I select IngestionRate
and ActiveSeries
to validate and monitor usage against service quotas, as shown in the following figure. If I see either of these metrics approaching my account’s quota, I could request a quota increase via the AWS support console.
Figure 1: CloudWatch metrics for IngestionRate
and ActiveSeries
for an Amazon Managed Service for Prometheus workspace.
I could also review the DiscardedSamples
metric in the AWS/Prometheus
namespace. Seeing non-zero values in the DiscardedSamples
metric may indicate that the workload is being throttled due to an Amazon Managed Service for Prometheus service quota.
For the next step, I’ll review metrics to make sure that Amazon Managed Service for Prometheus rules and alerts are working properly. You can review RuleEvaluationFailures
and RuleGroupInterationsMissed
in the AWS/Prometheus
namespace to see if there are any problems with the rules that you have created. After reviewing those metrics, I looked at the AlertManagerAlertsReceived
and AlertManagerNotificationsFailed
metrics in the AWS/Prometheus
namespace.
I noticed that my workspace didn’t seem to be sending alerts. Sure enough, when looking at the AlertManagerAlertsReceived
and AlertManagerNotificationsFailed
metrics, I can see that alert manager has received alerts (the blue line), but it has had problems processing the alerts (the red line), as shown in the following figure.
Figure 2: CloudWatch metrics for AlertManagerAlertsReceived
and AlertManagerNotificationsFailed
for an Amazon Managed Service for Prometheus workspace.
In reviewing the alert manager definition for the workspace, I discovered that the SNS topic doesn’t allow the workspace to publish messages. After fixing the permission issue by granting the Amazon Managed Service for Prometheus service the sns:Publish
and sns:GetTopicAttributes
permissions on the SNS topic, the AlertManagerNotificationsFailed
metric drops to zero. This indicates that alerts are now successfully being processed.
In this blog post, I demonstrated the use of vended metrics for Amazon Managed Service for Prometheus. I demonstrated how you can monitor your workspace usage against service quotas, and I demonstrated how these metrics helped me identify an issue in an alert manager configuration. Vended metrics are provided free of charge.
You can use these metrics to validate and monitor your usage against quotas, and you can validate that rules and alerts are operating the way you’re expecting. As a next step, review the metrics in the CloudWatch console to ensure that your monitoring stack is working correctly.