AWS Startups Blog

FogLogic on Three Next Practices for Optimizing SAP Service Levels Leveraging AIOps

Guest post by Ashok Santhanam, CEO and Co-founder of FogLogic

As an SAP Operations Manager you need to ensure that IT systems are operating at maximum efficiency and in a Known Good State. As a result, you are under constant pressure to meet service level targets and performance metrics, such as Mean-Time-to-Resolution (MTTR). In this article, I address the three most important next practices for addressing this challenge.

But first, you may be asking, what do you mean by “next practice”? The pace of change in today’s digitalized business environment means that what has worked in the past—as codified by best practices—increasingly will not work moving forward. This has given rise to the concept of next practices.  Next practices do not focus on improving existing processes since existing processes are becoming increasingly obsoleted by transformative technologies. They deal with the best ways to rethink your processes for the future, leveraging transformative technologies like AI and machine learning, to make your processes smarter.

My three recommended next practices address three key pain points in achieving service-level targets for mission-critical enterprise environments like your SAP applications landscape. They include (1) alert fatigue, (2) making sense of your data, and (3) eliminating trial and error problem remediation.

ALERT FATIGUE

You receive a barrage of alerts every day and over 90% of these alerts you ignore reflexively. This is despite the risk of negatively impacting team productivity and MTTR by prematurely acting on the wrong alerts and dismissing consequential warnings. This results in Alert fatigue. The root cause of alert fatigue is reliance on a hodgepodge of unintelligent, legacy tools used to monitor a growing landscape of systems and solution stack layers.

NEXT PRACTICE #1

To eliminate alert fatigue, leverage AI/ML-based Anomaly Detection to alert you only when real anomalies in system behavior are detected.

For this to work, you need a solution that first centralizes and consolidates data across your entire SAP landscape and solution stacks and then applies machine learning to perform Behavior Profiling. Dynamic Thresholds can then be deployed to trigger alerts and notifications based on known or expected behavior. Rather than being determined by SAP recommended default values and manually adjusted over time, Dynamic Thresholds continuously monitor and learn from actual performance behavior. Finally, AI-guided Scoring prioritizes anomalies for further evaluation based on a holistic view that considers several factors like anomaly magnitude, frequency, and clustering. With this approach, you manage by exception exclusively, seeing only anomalies you need to pay attention to and further investigate.

MAKING SENSE OF YOUR DATA

Another key challenge is isolating the root cause of system performance problems so that you can recommend and apply effective fixes quickly. This is a vexing task because there is no efficient and repeatable process for making sense of all the system performance and behavior data coming in from your many monitoring tools. Moreover, if you are like most of your industry peers, you are not clear where to start and who to involve in an incident investigation. And, because your company’s expertise is in different organizational silos (including that of your outsourcing partners), you find yourself relying on tribal knowledge. This can inhibit your ability to drive more quickly to a holistic view and have high confidence in the recommended remediations.

NEXT PRACTICE #2

To make sense of your data and isolate the root cause of incidents quickly, use AI/ML to perform Incident contextualization and minimize the on-going reliance on specialist tribal knowledge.

Modern incident contextualization consists of three primary components. First, Policy-based Escalation manages the hand-off from the anomaly detection process to the incident investigation team based on rules that are continually tuned leveraging the cumulative industry experience and learnings of the users of the tool, as well as user domain knowledge. Second, Accelerated Root Cause Isolation leverages machine learning algorithms to determine metric correlation, incident co-occurrence, and seasonality effects, based on time series and log analysis. Finally, an Automated Recommendation engine links the root cause to the relevant SAP knowledge base of suggested best practices and remediation actions. Thus, it focuses the remediation process on a narrower “curated” set of probable root cause options.

ELIMINATING TRIAL & ERROR REMEDIATION

The last key challenge I want to raise is the painstaking trial and error approach your team must take to resolve incidents. This is time-consuming and doesn’t lend itself to continuous process improvement and more effective and predictable outcomes via systematic learning.

NEXT PRACTICE #3

To retire your trial & error approach to incident remediation, leverage AI/ML and automation technologies to implement Proactive Incident Resolution and closed-loop learning.

A modern solution for proactive resolution—that addresses the pain points of trial and error remediation and lack of learning—consists of three primary components. First, Rules-driven Mobilization automates policies for triggering and routing remediation team activity and integrates with ITSM systems. Remediation Workflow is at the heart of a proactive resolution solution. It standardizes communication across teams for tasks like applying, testing, and reporting on fixes—leveraging integrations with collaboration tools like Slack and Microsoft Teams. Finally, Closed-loop learning ensures that information about what worked and what didn’t for a given problem are fed back into the knowledge base so that automated recommendations improve over time.

CONCLUSION

Given expected skills shortages and increasing service level expectations across a growing landscape of IT systems and initiatives, you need to scale your anomaly detection, root cause analysis and incident resolution process in a way that leverages resources a lot more efficiently. Simply throwing more automation at your existing process may not be a viable option. FogLogic invites you to learn more about its integrated AIOps solution for incident lifecycle management that ensures that you are always operating in a Known Good State.