AWS Cloud Operations Blog
How to Automate Incident Response with PagerDuty and AWS Systems Manager Incident Manager
Incident response is a core operations capability for organizations to develop, and a core element in the AWS Cloud Adoption Framework (AWS CAF). Responding to operations incidents quickly is important to minimize their impacts. Automating incident response helps you scale your capabilities, rapidly reduce the recovery time, and reduce repetitive work by your cloud operations teams.
In this post, I show you how to use Incident Manager, a capability of AWS Systems Manager, to build an effective automated incident management and response solution to events by integrating with PagerDuty.
You’ll walk through three common operations-related activities and how you can use Incident Manager to automate your response by centrally storing incident metrics, runbook to resolve incidents, and post-incident analysis.
- Create an Incident Response Plan: Planning for an incident begins long before the incident lifecycle. To prepare for an incident, its elemental to adhere to best practices by configuring Alerting and engagement to bring awareness to incidents within your applications. And its critical to be aware of existing third-party integrations such as PagerDuty to leverage established user directories, chat channels, and automated escalation plans.
- Start and Close Incident: Amazon Incident Manager, a capability of AWS Systems Manager, helps you manage and quickly respond to incidents. In this blog post, you will learn how to create an incident manually on the incident list page and observe the PagerDuty alerts and notifications. You can also use the StartIncident API action from the AWS CLI or the AWS SDK.
- Conduct Post Incident analysis: When the system creates an incident, Incident Manager automatically collects information about the AWS resources involved in the incident and adds this information to the Related items tab. In this blog post, you will use Post-incident analysis to guide you through identifying improvements to your incident response, including time to detection and mitigation. An analysis can also help you understand the root cause of the incidents. Incident Manager creates recommended action items to improve your incident response.
Prerequisites
If this is your first time using Incident Manager, follow the initial onboarding steps in Getting prepared with Incident Manager.
For this walkthrough, you should have the following prerequisites:
- An AWS account and AWS Identity Access and Management (IAM) permissions to access Systems Manager, Incident Manager, KMS, and Secrets Manager. Your IAM user or role should also have iam:CreateServiceLinkedRole permissions. Incident Manager uses this permission to create the service-linked role.
- PagerDuty Third-Party Service and API Create Permissions to generate the key that will be stored using AWS Secrets Manager for securely communicating with PagerDuty.
Overview of solution
This blog post will help you understand how to use Incident Manager to integrate with PagerDuty’s existing pager notification plan to improve and streamline engagement. By ingesting incident response plan alerts from Incident Manager, PagerDuty can continue to be your central nexus of your Incident Management & AIOps processes. Incident Manager enables the ability to centrally store and automate those processes so that you are more resource efficient, reducing time and toil for your responders, reducing cost for your business owners, and increasing availability for your customers. All of which contributes to building a reputation of brand trust and product reliability in the marketplace.
Walkthrough
You’ll walk through three common operations-related activities and how you can use Incident Manager to automate your AWS Response Plan with your existing PagerDuty service.
To integrate Incident Manager with PagerDuty, you must first create an API Key within PagerDuty’s console that will be stored using AWS Secrets Manager. This credential allows Incident Manager to securely communicate with your existing PagerDuty service. You will then create a Response Plan with Incident Manager and select PagerDuty as the third-party integration. Once a Response Plan is triggered, Incident Manager will leverage your existing PagerDuty paging on-call structure, user and team directory, and escalation policies.
Create PagerDuty API Key
To integrate Incident Manager with PagerDuty, you must first create your API Access Key within the PagerDuty Console. This key will then be used to create a secret in AWS Secrets Manager that contains your PagerDuty credentials. You can then include a PagerDuty service in the response plan that you create below inside Incident Manager.
- Access PagerDuty Console
- Select Integrations from the PagerDuty Console Menu
- Select Developer Tools
- Select API Access Keys
- Select “+Create New API Key”
- Enter API Key Description and select “Create Key”
To store PagerDuty access credentials in an AWS Secrets Manager secret follow the steps in Create an AWS Secrets Manager secret in the AWS Secrets Manager User Guide.
Open the Secrets Manager console https://console.thinkwithwp.com/secretsmanager/
Choose Store a new secret.
- For Secret type, choose Other type of secret.
- Choose the Plaintext tab
- Replace the default contents of the box with the following JSON structure:
- Encryption key, choose a customer managed key you created that meets the requirements listed under the previous Prerequisites section.
- For Resource permissions, do the following:
Expand Resource permissions.
Choose Edit permissions.
Replace the default contents of the policy box with the following JSON structure:
Choose Save.
On the Review page, review your secret details, and then choose Store.
Secrets Manager returns to the list of secrets. If your new secret doesn’t appear, choose the refresh button. This secret will be used during the Response Plan setup.
Create an Incident Manager Response Plan
A response plan ties together the contacts, escalation plan, and runbook. When an incident occurs, a response plan defines who to engage, how to engage, which runbook to initiate, and which metrics to monitor. By creating a well-defined response plan, you can save your security team time down the road.
Create a Response Plan
Once you’ve created your API Key, Secret, and contacts, you can create a response plan to define how to respond to incidents and configure PagerDuty integration. Refer to the Best Practices for Response Plans.
Note: (Optional) You can also create an additional contacts and escalation plans that lets you further define and automatically alert adjacent subject matter experts and the escalation plans for your contacts. You can learn more in Add Contacts and Create an escalation plan.
To create a response plan
- Open the Incident Manager console, and choose Response plans in the left navigation pane.
- Choose Create response plan.
- Enter a unique and identifiable name for your response plan.
- Enter an incident title. The incident title helps to identify an incident on the incidents home page.
- Select an appropriate Impact based on the potential scope of the incident.
- Select the box for Third-Party integrations with PagerDuty.
- Note: The PagerDuty integration services name will auto populate in Figure 7
- (Optional) You can also create a runbook that can drive the incident mitigation and response. For further information, refer to Runbooks and automation.
- Under Execution permissions, choose Create an IAM role using a template. Under Role name, select the IAM role you created in the prerequisites that allows Incident Manager to run SSM automation documents, and then choose Create response plan.
Start Incident
Incident Manager response plan can be invoked immediately and the engagement plan with your contacts will begin alerting PagerDuty. PagerDuty will leverage existing user directories, escalation flows, and alerts to notify on-call teams.
Start Incident
- Select an Start Incident in the Incident Manager console
- Specify Response Plan
- (Optional) Override Title to more easily describe incident
- (Optional) Override Incident impact
- Select Start
- (Optional) Open Incident and select “Related items” tab to assist response team with additional artifacts related to the incident
Verify Alerts and Notifications in PagerDuty by logging into the PagerDuty Console. The PagerDuty on call team will be notified using existing policies. Incident Manager also publishes timeline events as notes to the incident in PagerDuty and allows you to resolve PagerDuty incidents when you resolve the related incident in Incident Manager.
As incidents progress thru analysis and resolution the related items tab can be used to automatically alert the PagerDuty on-call team of adjacent artifacts, such as a JIRA ticket or log file by providing a URL, ARN, or link to S3 object. The related items are then displayed automatically as “Notes” inside the PagerDuty console, further streamlining communications.
Resolve Incident
Post-incident analysis guides you through identifying improvements to your incident response, including time to detection and mitigation. An analysis can also help you understand the root cause of the incidents. Incident Manager creates recommended action items to improve your incident response.
Conduct Post Incident Analysis
- Select Incident Manager
- Select Incident requiring resolution
- Select Resolve Incident
- Select Create Analysis
Once an incident is closed you are able to verify in PagerDuty that the incident is acknowledged and resolved. This allows PagerDuty to centralize and track enterprise wide SLA performance across various on-call teams.
- Discuss the incident with the impacted team by populating the overview, timeline, metrics, questions, and action items before completing the analysis.
- Select Complete to save a record of the incident for continuous improvement share learnings within your organization.
To avoid incurring future charges, delete the resources. Costs of running these steps are according to usage of all mentioned services, refer to the Incident Manager pricing.
Conclusion
In this post, I showed you how to use Incident Manager to centrally store incident response plans, configure PagerDuty, and perform post-incident analysis. I demonstrated how you can create an incident management and response plan to ensure you have used the power of cloud to create automations that respond to and mitigate incidents in a timely manner. To learn more about Incident Manager, see What Is AWS Systems Manager Incident Manager in the AWS documentation.
You can also integrate Incident Manager, a component of AWS Systems Manager, with additional products and services. The following products and services can integrate with Incident Manager:
- ServiceNow – For more information, see Integrating AWS Systems Manager Incident Manager in ServiceNow.
- Jira Service Management – For more information, see Configuring Jira Service Management in the AWS Service Management Connector Administrator Guide.
- Jira Cloud – For more information, see Integrating AWS Systems Manager Incident Manager in the AWS Service Management Connector Administrator Guide.
About the authors: