AWS Cloud Operations Blog

Operational insights in Systems Manager OpsCenter help you identify duplicate issues and noisy event sources

If you use AWS Systems Manager OpsCenter, you might be familiar with the challenges of large numbers of OpsItems. When the same problem causes the creation of a significant number of OpsItems, it can be hard to see that these OpsItems are in fact the result of a single issue. It can also be difficult to see other unique issues in the noise, which can cause you to miss critical issues. Although it’s good practice, it takes time to close a lot of related OpsItems. If you overlook them and leave them open, you might waste time later when you’re troubleshooting an issue.

Operational insights, a new OpsCenter feature, can help you improve operational efficiency by:

  • Identifying noisy and duplicate OpsItems.
  • Providing recommendations and suggested automations to reduce the creation of unnecessary OpsItems.
  • Resolving OpsItems in bulk.

The feature currently provides two insights:

  • Duplicate OpsItems: This insight helps you identify OpsItems that might have the same root cause and are duplicates. We identify these by collecting OpsItems with the same title and resource.
  • Sources generating most OpsItems: This insight helps you identify sources that are generating more than the expected number of OpsItems. We identify these by collecting OpsItems with the same title, but no resource.

Note: These insights are not generated in real time, but through a scheduled batch process. This means you must wait for that process to run before you see any insights. If you follow along with the example in this post, generate a few events and come back later to explore the generated insights.

In this blog post, we share an example of an Amazon EventBridge rule that creates an OpsItem when a virtual machine (VM) changes state (for example, goes from running to stopped). In our example, the VM is an Amazon Elastic Compute Cloud (Amazon EC2) instance. We’ll show you how to view an operational insight, how to reduce unnecessary duplicate and noisy OpsItems occurring in the future, and how to bulk-resolve the items identified by these insights.

Prerequisites

To follow the steps in this post, you need to enable Systems Manager and Operational insights in your account. To enable Systems Manager, follow the steps in the Manage instances using AWS Systems Manager Quick Setup blog post. If you want to customize Systems Manager, see Setting up AWS Systems Manager in the AWS Systems Manager User Guide.

To enable Operational insights, from the left navigation pane of the Systems Manager console, choose OpsCenter. Under Operational insights, choose Enable, as shown in Figure 1.

Under Operational insights, there is text that explains they reduce noise by identifying duplicate OpsItems or sources with unusual activity. When the feature is enabled, OpsCenter creates a service-linked role named AWSServiceRoleForAmazonSSM_OpsInsights.

Figure 1: Enable Operational insights

Create the EventBridge rule

In the EventBridge console, complete the fields as shown in Figure 2. In Define pattern, choose Event pattern. Under Event matching pattern, choose Pre-defined pattern by service. For Service provider, choose AWS. For Service name, choose EC2. For Event type, choose EC2 Instance State-change Notification. Under Select targets, for Target, choose SSM OpsItem.

The fields for the EventBridge rule are set as described in the post.

The fields for the EventBridge rule are set as described in the post.

Figure 2: EventBridge rule with event pattern and targets defined

After you have created the EventBridge rule, you can create an EC2 instance and stop and start it a few times to generate an operational insight. In our example, we created OpsItems for 12 state changes for one EC2 instance. Because a state change progresses from Pending, Running, Stopping, and Stopped, this represents three full cycles of starting and stopping the instance.

Figure 3 shows the number of open operational insights (one duplicate OpsItem). A maximum of 25 insights can be open at any time, after which no new insights will be created.

To view the insight generated from our EventBridge rule, choose View all operational insights.

Under Insight type, there is one open duplicate OpsItem.

Figure 3: Operational insights

If you have lots of OpsItems, you can filter from the All insight types dropdown. You’ll see the EC2 instance has changed state a number of times, which created multiple OpsItems. It is identified as a duplicate because the OpsItems have the same title and resource (the EC2 instance).

On Operational insights, there are table columns for insight type (in this example, Duplicate OpsItem), ID, title, last updated date, and status.

Figure 4: Multiple OpsItems created with the same title “EC2 Instance State-change Notification”

Reduce duplicate OpsItems

We want to create a single OpsItem instead of multiple OpsItems.

Choose the insight ID to open the details page for the operational insight. Figure 5 shows the details (insight type, number of affected OpsItems, description, status, date created, and last updated date).

The insight type is Duplicate OpsItems. The number of affected OpsItems is 12. The status is Open.

Figure 5: Insight Details

Recommended runbooks

You can see recommended actions you can take in the form of runbooks. The runbooks vary. They are recommended to help you to reduce noise, not to resolve the underlying issue that triggered the OpsItem. You should do root cause analysis and take appropriate action to remediate the issue.

The first recommended runbook is for adding a deduplication string to the EventBridge rule. The second runbook is for bulk resolution of the OpsItems.

Figure 6: Recommended runbooks

When multiple runbooks are recommended, apply them in the order provided. In Figure 6, the recommendation is to apply the AWS-AddOpsItemDedupStringToEventBridgeRule runbook to add a deduplication string to reduce the number of duplicate OpsItems and to apply the AWS-BulkResolveOpsItems runbook to resolve all the OpsItems already created.

View runbook automations

The details page shows the automation history, which includes any runbooks you have executed. It also includes a Tips section. In Figure 7, the tip is to add a deduplication string. The AWS-AddOpsItemDedupStringToEventBridgeRule runbook will help us with that.

Under Automation executions in the last 30 days, the table is empty.

Figure 7: Automation execution history

Add a deduplication string

When you build an EventBridge rule to create an OpsItem, you have the option to specify a deduplication string. If you specify a deduplication string, an OpsItem is created only if there are no other open OpsItems with the same deduplication string. For more information, see Working with deduplication strings in the AWS Systems Manager User Guide.

In our example, the recommended deduplication string, EC2 Instance State-change Notification, is in the runbook description and the Tips section of the operational insight.

To execute the runbook, choose it in the list and then choose Execute.

Under Recommended runbooks, AWS-AddOpsItemDedupStringToEventBridgeRule is selected.

Figure 8: Recommended runbooks with AWS-AddOpsItemDedupStringToEventBridgeRule selected

Figure 9 shows the required input parameters for this runbook. Recommendations from the operational insight are already populated. You do not need to change these values, but you can modify the value in DedupString, if you like. Choose Execute to execute the runbook.

The RuleName value is EC2-State-Change. The DedupString value is EC2 Instance State-change Notification.

Figure 9: Runbook input parameters

View the EventBridge rule

Go to your EventBridge rule. In Figure 10, you’ll see that the runbook has added input transformations to the target (previously empty). This is how you tell EventBridge to modify the event information before sending it to create an OpsItem.

This change means that every event sent to an OpsItem, for every EC2 instance state change, will have the same deduplication string. So, only the first EC2 state change event will create an OpsItem, which will reduce noise.

For more information, see Transforming Amazon EventBridge target input in the Amazon EventBridge User Guide.

EventBridge rule showing the addition of a deduplication string after executing the runbook. The deduplication string is included as a dedupString parameter in the Input transformer section of the EventBridge rule.

Figure 10: EventBridge rule showing the addition of a deduplication string after executing the runbook

Resolve OpsItems in bulk

We executed the runbook to add the deduplication string to reduce future noise, but we still have an operational insight with multiple OpsItems open. We can now run the second runbook to bulk-resolve all the OpsItems in this insight.

The Operational insights feature introduced a new runbook that resolves multiple OpsItems in a single operation. Choose AWS-BulkResolveOpsItems and then choose Execute.

Under Recommended runbooks, AWS-BulkResolveOpsItems is selected.

Figure 11: Recommended runbooks with AWS-BulkResolveOpsItems selected

As with the previous runbook, the required parameters are already populated. Choose Execute.

On the OpsCenter summary page, the OpsItems are no longer open. They are set to Resolved.

Resolve the insight

Your insight might still be visible after you complete these runbooks. That’s because operational insights are not generated in real time, but through a scheduled batch process. After this process runs again, if all the conditions for the insight are no longer satisfied, the insight will be resolved for you.

Change the state of the instance

After your insight has resolved itself, try changing the state of your EC2 instance a few times. Now that you have applied a static deduplication string, you should only see a single OpsItem, regardless of the number of state changes or EC2 instances.

Identify the sources generating the most OpsItems

Some OpsItems do not have a resource specified, but can still be noisy. These might be misconfigured rules, rules without sufficient useful information to act on, or noisy sources we want to adjust our rules for.

Consider our example rule: If we had created the OpsItem without a resource ID, how would we know which EC2 resources to troubleshoot? This is an example of a misconfigured rule that we would want to modify.

The second type of operational insight helps when we have multiple OpsItems with the same title, but no resource information. What’s different for these operational insights?

The only difference is the recommended resolution. In this case, the recommended runbooks (in order) are AWS-DisableEventBridgeRule to disable the EventBridge rule and AWS- BulkResolveOpsItems to resolve all the items already created.

The deduplication string runbook is not recommended, because it would have no impact in reducing the number of OpsItems created.

As with our previous example, run both runbooks in order.

Disabling the EventBridge rule is the only way to reduce the noise generated by these OpsItems. Before you disable the rule, consider whether you need to be aware of the event’s occurrence. If you do, you can use bulk resolution to save time.

Pricing

The creation of an operational insight is charged at the same rate as the creation of an OpsItem. Each runbook carries a cost of two API calls for each resolved OpsItem. For more information, see AWS Systems Manager pricing.

Cleanup

To avoid charges in your account, delete the resources you created.

To cleanup your EC2 instance see the documentation for Terminate your instance, and to cleanup your EventBridge rule see the documentation on Disabling or deleting an Amazon EventBridge rule.

To disable operational insights, see Disabling operational insights. If you disable operational insights, you will stop new insights from being created, but will not remove existing insights.

Conclusion

In this post, we introduced you to the Operational insights feature and shared an example to help you see the insight details, use recommended runbooks to reduce noise, and bulk-resolve existing OpsItems.

By acting on operational insights, you can make sure OpsItems contain appropriate events and improve your visibility and the time to resolution of operational issues. For more information, see Working with operational insights in the AWS Systems Manager User Guide.

About the authors

Helen Ashton

Helen Ashton is a Solutions Architect at AWS, based in Calgary, Canada. Helen is passionate about helping customers solve their business problems, and progress through their cloud journey. Outside work she enjoys music, biking and gardening.

picture of author michael heyd

Michael Heyd

Michael Heyd is a Solutions Architect with Amazon Web Services and is based in Vancouver, Canada. Michael works with enterprise AWS customers to transform their business through innovative use of cloud technologies. Outside work he enjoys board games and biking.