AWS Cloud Operations Blog
Using Amazon Managed Service for Prometheus Alert Manager to receive alerts with PagerDuty
Many customers using Amazon Managed Service for Prometheus are transitioning from their self-managed Prometheus systems to the fully managed service. Within this transition journey, Amazon Managed Service for Prometheus users need ways to migrate their existing Prometheus and Alert Manager configurations. PagerDuty is a receiver used by many customers to route alerts to their internal team. However, Amazon Managed Service for Prometheus Alert Manager only supports an Amazon Simple Notification Service (Amazon SNS) receiver, and cannot directly send to PagerDuty. This guide walks through how to hook up Amazon Managed Service for Prometheus Alert Manager to Amazon SNS in order to route messages to PagerDuty so as to mimic the various controls and flexibility that exists today with the Alert Manager PagerDuty Receiver.
Component Overview
The Amazon Managed Service for Prometheus Alert Manager handles alerts sent via client applications, including the Amazon Managed Service for Prometheus server. The Amazon Managed Service for Prometheus Alert Manager configuration handles how each alert will be routed. I want to connect the Amazon Managed Service for Prometheus Alert Manager with PagerDuty so that it mimics the native PagerDuty receiver in OSS Alert Manager. To do this, I need to configure the Amazon Managed Service for Prometheus Alert Manager definition, route any output messages via Amazon SNS, and re-route the messages from SNS to PagerDuty via an AWS Lambda function that posts to the api endpoint for PagerDuty.
The following example utilized a Prometheus server to monitor a node exporter job. Then, I am remote writing the collected metrics to my Amazon Managed Service for Prometheus workspace. See Figure 1.
Figure 1: Architecture to configure Amazon Managed Service for Prometheus Alert Manager with PagerDuty
Rules management
Once an Amazon Managed Service for Prometheus workspace has been created, you must configure the workspace with one or more rules. Amazon Managed Service for Prometheus supports both alerting and recording rules. As a simple example, I have created an alerting rule that fires when a Prometheus node exporter job is down.
groups:
- name: example
rules:
- alert: DemoAlert
expr: up{job="node"} == 0
for: 1m
annotations:
summary: "Prometheus job missing (instance {{ $labels.instance }})"
description: "A Prometheus job has disappeared\n VALUE : {{ $value }}\n LABELS : {{ $labels }}"
labels:
severity: warning
To add a rule to the Amazon Managed Service for Prometheus server, first encode the rules file in a base64 format. I used OpenSSL to base64-encode the YAML rules file as follows:
openssl base64 -in <input file> -out <output file>
Once the file has been base64-encoded, it can be added to the Amazon Managed Service for Prometheus server via this CLI syntax:
aws amp create-rule-group-namespace --data file://<path to base64-encoded file> --name <Namespace> --workspace-id <workspace_id> --region <region>
You can also upload the rule file via the Amazon Managed Service for Prometheus console.
Then, after uploading the rule file, my rule group namespace is created and moves from a Creating
to an Active
status. By clicking on the namespace link in Amazon Managed Service for Prometheus lets me see that the rule has been successfully imported. See Figure 2.
Figure 2: The alerting rule has been successfully imported into the Amazon Managed Service for Prometheus workspace
SNS and Lambda configuration
Before configuring Alert Manager, set up an SNS topic that Amazon Managed Service for Prometheus will utilize to send alerts. Once the topic has been created, grant the Amazon Managed Service for Prometheus permission to publish to the topic. This is done by going to the Access Policy section of the SNS topic in the SNS console and adding the following statement, in addition to replacing <region_code>
, <account_id>
and <topic_name>
for the actual values:
{
"Effect": "Allow",
"Principal": {
"Service": "aps.amazonaws.com"
},
"Action": [
"sns:Publish",
"sns:GetTopicAttributes"
],
"Resource": "arn:aws:sns:<region-code>:<account_id>:<topic_name>"
}
This access policy grants Amazon Managed Service for Prometheus the sns:Publish
and sns:GetTopicAttributes
permissions for the SNS topic identified in the Resource section.
Then, create a Lambda function to trigger off any messages sent to the SNS topic that was created above. As the Alert Manager configuration will eventually be written in YAML, this Lambda function converts the message body received from YAML to JSON, and then sends the resulting JSON to the PagerDuty API. This function uses the PyYAML library, so in order to make the library available within a Lambda function, I must create a deployment package with dependencies. Then, I set up the Lambda function as a subscriber to the SNS topic just created.
import urllib3
import json
import yaml
http = urllib3.PoolManager()
def lambda_handler(event, context):
#In this implementation, payload.summary is set to description (to mimic pagerduty_config.description)
#In this implementation, payload.source is set to client_url
url = "https://events.pagerduty.com/v2/enqueue"
msg = yaml.safe_load(event['Records'][0]['Sns']['Message'])
details = None
links = None
summary = None
client_url = None
severity = None
routing_key = None
############################################################
#Remove elements
if 'description' in msg.keys():
summary = msg['description']
msg.pop('description')
if 'client_url' in msg.keys():
client_url = msg['client_url']
msg.pop('client_url')
if 'severity' in msg.keys():
severity = msg['severity']
msg.pop('severity')
if 'details' in msg.keys():
details = msg['details']
msg['details'] = ""
msg.pop('details')
if 'links' in msg.keys():
links = msg['links']
msg['links'] = ""
msg.pop('links')
#Remove integration key before logging the payload
if 'routing_key' in msg.keys():
routing_key = msg['routing_key']
msg['routing_key'] = ""
msg.pop('routing_key')
############################################################
#Add event_action back in
if event['Records'][0]['Sns']['Subject'].find('[RESOLVED]') > -1:
msg.update({"event_action":"resolve"})
else:
msg.update({"event_action":"trigger"})
#Add payload fields back in
payload = { "payload": { "client_url": client_url, "severity": severity, "summary": summary, "source": client_url } }
msg.update(payload)
#Add details fields
if details is not None and len(details) > 0:
details = { "custom_details": details }
msg["payload"].update(details)
#Add links fields
if links is not None and len(links) > 0:
msg["links"] = links
encoded_msg = json.dumps(msg).encode('utf-8')
resp = http.request('POST',url, body=encoded_msg, headers={'x-routing-key': routing_key})
print({
"message": msg,
"status_code": resp.status,
"response": resp.data
})
Alertmanager configuration
Now that the SNS topic and Lambda function are in place, let’s configure the Amazon Managed Service for Prometheus Alert Manager.
The OSS Alert Manager has the following structure for a PagerDuty receiver:
pagerduty_config:
- send_resolved: true
routing_key: <tmpl_secret>
service_key: <tmpl_secret> # only used when using integration type prometheus
client_url: <tmpl_string> # link to be included in the alert in PD
severity: <tmpl_string> # error, info
description: <tmpl_string> # description of the alert
details: { <string>: <tmpl_string>, ... } # arbitrary dictionary
links: [....<link_config>...]
Mimic this interface in the Amazon Managed Service for Prometheus Alert Manager definition, so that it’s easy to lift and shift the existing OSS Alert Manager configurations into Amazon Managed Service for Prometheus Alert Manager. To do this, utilize the SNS receiver block in the Amazon Managed Service for Prometheus Alert Manager definition. Under the message
block, I have created keys to mimic the PagerDuty configuration.
sns_configs:
- send_resolved: true
topic_arn: <topic_arn>
sigv4:
region: <region>
message: |
routing_key: <tmpl_secret>
dedup_key: <tmpl_string> # necessary to resolve the alert in PD
client_url: <tmpl_string> # link to be included in the alert in PD
severity: <tmpl_string> # error, info
description: <tmpl_string> # description of the alert
details: { <string>: <tmpl_string>, ... } # arbitrary dictionary
links: [....<link_config>...]
The properties under the message block will be transformed via the Lambda function into JSON and into a structure that the PagerDuty API understands. The Amazon Managed Service for Prometheus Alert Manager configuration must be wrapped in an alertmanager_config
block at the YAML file root.
Just like the process for importing Amazon Managed Service for Prometheus rules, the YAML file for Alert Manager must be base64-encoded. Once it has been base64-encoded, it is added to the Amazon Managed Service for Prometheus server via the following CLI syntax:
aws amp create-alert-manager-definition --data file://<path to base64-encoded file> --workspace-id <workspace_id> --region <region>
Likewise, Alert Manager configuration can also be uploaded via the Amazon Managed Service for Prometheus console.
After a few moments, Amazon Managed Service for Prometheus Alert Manager transitions to an Active status, and we can see the alert I created. See Figure 3.
Figure 3: The alert has been successfully created in Amazon Managed Service for Prometheus Alert Manager
Testing out the solution
The Amazon Managed Service for Prometheus rule that was set up is designed to fire when a node exporter job goes down. To test out the full solution, simply stop the node exporter service being monitored. Once the Amazon Managed Service for Prometheus rule fires, Amazon Managed Service for Prometheus Alert Manager sends the alert to the SNS topic that the Lambda function is subscribed to. Lambda then parses the SNS message and sends it off to PagerDuty. In a moment, the PagerDuty dashboard is updated with a new incident. See Figure 4.
Figure 4: An alert successfully pushed to PagerDuty via Amazon Managed Service for Prometheus Alert Manager
After restarting the node exporter job, Amazon Managed Service for Prometheus Alert Manager detects that the issue has been resolved, and then sends a final alert. Once PagerDuty receives the message, it marks the incident as resolved within the PagerDuty dashboard. See Figure 5.
Figure 5: An alert automatically resolved in PagerDuty
Conclusion
This post demonstrated a simple pattern for migrating alerting mechanisms from OSS Alert Manager to Amazon Managed Service for Prometheus Alert Manager. The combination of the Amazon Managed Service for Prometheus Alert Manager, an SNS topic, and a subscribed Lambda function demonstrated how to send alerts from Amazon Managed Service for Prometheus to PagerDuty. This pattern is especially helpful for organizations migrate their existing OSS Alert Manager configurations to Amazon Managed Service for Prometheus Alert Manager.
For more details on Amazon Managed Service for Prometheus Alert Manager, check out our documentation.