Enhance your business process resilience with Amazon CloudWatch Application Insights Observability for SAP High Availability – Part 2

Introduction

In the second part of our blog series, we will share more details on each of the key capabilities of Amazon CloudWatch Application Insights, to aid in using the insights gained, to catch trends in your system, understand the overall health of your SAP landscape, and react accordingly, to maximize your business continuity.

The capabilities we covered in part 1 of the blog series were aimed to improve the resilience of SAP business processes. We explained that Application Insights can automatically detect issues with SAP applications and infrastructure, providing visibility into the health and performance of the overall system. This allows SAP customers to proactively identify and resolve problems before they impact critical business processes. We highlighted how Application Insights can correlate metrics, logs, and alerts to provide a comprehensive view of the SAP environment, helping teams quickly troubleshoot issues and maintain business continuity.

Why observability for high availability in SAP systems is crucial?

High availability (HA) refers to the ability of a system or application to remain operational and accessible even in the event of hardware, software, or network failures. Customers implement HA to ensure their critical SAP systems remain operational and accessible, minimizing downtime and ensuring business continuity by providing redundancy and failover mechanisms.

While HA solutions require additional investment in infrastructure, software, and operational complexity, the potential costs of downtime and data loss often far outweigh these costs for businesses with mission-critical workloads. It prevents costly disruptions to SAP-dependent operations, and meets strict uptime and performance SLAs.

The ultimate goal of implementing observability best practices for HA with Application Insights for SAP is to establish a robust, proactive, and data-driven approach to managing the resilience and performance of mission-critical SAP systems.

Best practices for SAP operational excellence and resilience

Organizations running mission-critical workloads such as SAP on AWS in HA deployment (Figure-1) should implement best practices for maintenance event planning and prepare for unplanned events, such that downtime can be minimized. The SAP Lens of the AWS Well Architected Framework is a very useful way to start to achieve this. Relevant highlights of the SAP Lens relating to resiliency and observability, include:

Topics	Recommendations
Backup and Recovery	Implement and test regular backups (database, system, configurations). Store backups securely offsite for DR Review and update backup/recovery procedures as systems change.
HA and DR	Implement HA solutions (clustering, load balancing) for failover. Develop and test DR plans for business continuity. Regularly review and update HA/DR plans based on requirements.
Incident Management	Establish clear procedures for prompt incident response. Define roles and responsibilities during incidents. Maintain up-to-date contact information.
Monitoring and Alerts	Implement Application Insights metrics to detect issues impacting SAP availability. Configure CloudWatch alarms on critical SAP components. Regularly review and tune monitoring rules and alerts.
Troubleshooting and Analysis	Leverage Application Insights logging, tracing, and anomaly detection. Maintain knowledge base of common issues and steps. Encourage knowledge sharing within the team. Share Application Insights dashboards across organization.

Figure-1: SAP System in HA Deployment Pattern

In part-1 blog, you learned about steps to protect the Single Point of Failure (SPOFs) in an SAP system to improve availability. This includes safeguarding SAP ABAP Central Services (ASCS), Enqueue Replication Server (ERS), Application Servers, HANA database, Web Dispatcher and Network File System (NFS), HA Cluster plays important roles particularly protecting ASCS, ERS and HANA Database.

Let’s deep-dive into Application Insights for SAP HA metrics and its use-cases to monitor HA cluster. We will also look at the Problem Summary Dashboard which aids in processing ongoing problems with SAP system availability, summarizes the issue, and troubleshooting steps to resolve it.

Primer on Pacemaker cluster

Pacemaker is an open-source HA cluster resource manager, and it relies on resource agents to control and monitor the various resources (such as applications, services, and IP addresses) that need to be managed in a clustered environment.

In a Pacemaker cluster, a node refers to an Amazon Elastic Cloud Compute (EC2) instance that is part of the cluster. Pacemaker clusters are designed to provide HA for critical SAP resources running in the nodes, such as SAP ASCS and ERS, and HANA primary and standby database cluster, by allowing these resources to be available across Availability Zones (AZs). Each node has resource agents which manage and monitor the availability of SAP resources in the cluster (Table-1). In the context of SAP applications, resource agents are responsible for the following tasks:

Starting and stopping SAP instances: The SAP resource agents can start, stop, and monitor the various components of an SAP system, such as the database, ASCS/ERS, and other related services.
Monitoring SAP instances: The agents continuously monitor the health and status of the SAP instances running within the cluster. If an instance fails or becomes unresponsive, the agent can detect this and initiate a failover to another node in the cluster.
Failover and failback: When a failure is detected, the SAP resource agents coordinate the failover process, moving the SAP instances from the failed node to a healthy node in the cluster. Similarly, when the failed node recovers, the agents can facilitate the failback process, moving the instances back to the original node.
Resource dependencies: SAP systems often have dependencies on other resources, such as file systems, IP addresses, or other services. The SAP resource agents can manage these dependencies and ensure that the required resources are available and properly configured before starting the SAP instances.

Resource Agent	Agent action and Use-case
SAPHana	Agent tasked to manage stop, start, and monitor SAP HANA database, and takeover with HANA System Replication (HSR).
SAPHanaTopology	Agent to gather status of HSR, and communicate to HANA database nodes
SAPInstance	Agent tasked to stop, start and monitor SAP application cluster nodes running ASCS and ERS processes.
SAPDatabase	Agent tasked to start, stop and monitor SAP HANA database, and other supported databases.
aws-vpc-move-ip	Agent tasked to update route table entries which moves the overlay IP address during an AZ failover event. This agent interacts with other AWS services control plane APIs such as ReplaceRoute and DescribeRouteTables.
aws-vpc-route53	Agent used by Application Servers to route traffic to database servers with private Domain Name Service (DNS) host records. This configuration is not used in SAP Systems in HA Deployments which uses overlay IP address with a Transit Gateway (TGW) or Network Load Balancer (NLB).
fence agent	Agent used to stop a cluster node. This agent interacts with other AWS services control plane API such as DescribeInstances, StopInstance and StartInstance.
Filesystems resource management	Agent manages mounting and unmounting of filesystems in a classic setup of SAP ASCS/ERS cluster

Table-1: Pacemaker cluster resource agents

Figure-2: SAP HANA Database – Pacemaker cluster status

Figure-3: SAP ASCS/ERS – Pacemaker cluster status

You can find more guidance on how to configure Pacemaker cluster in SAP NetWeaver on AWS: high availability configuration for SUSE Linux Enterprise Server (SLES) for SAP applications, SAP NetWeaver on AWS: high availability configuration for Red Hat Enterprise Linux (RHEL) for SAP applications and SAP HANA on AWS: High Availability Configuration Guide for SLES and RHEL.

Pacemaker cluster metrics for SAP ASCS, ERS and HANA database

Application Insights for SAP will monitor Pacemaker cluster actions in near real-time and trigger configured alarms and route notification to the respective teams. Pacemaker cluster events are written to pacemaker log files which is used for root-cause analysis with Application Insights for SAP Problem Summary Dashboard, and metric anomalies are analyzed to implement preventive and corrective actions. You will be able to view SAP availability through single-pane-of-glass of pre-built dashboard which can be shared across the organization and AWS accounts.

Metric: sap_HA_check_failover_config_state
Check Pacemaker cluster configuration
Description: Verifies Pacemaker cluster failover configuration. You can check cluster status, resource dependencies, policies, monitoring, fencing, and failover readiness for SAP system.	Use-case: Proactively identify and resolve failover configuration issues. This metric ensures the SAP HA environment is prepared to handle failover seamlessly, minimizing downtime and potential data issues.

Metric: sap_HA_get_failover_config_HAActive
Check Pacemaker cluster status
Description: Validates HA config, monitors changes, aids troubleshooting during incidents, and automates monitoring/alerting for inactive cluster state.	Use-case: Provides visibility into SAP failover config state. Validates and monitors HA setup, responds to changes, and ensures SAP availability.

Figure-4: Pacemaker cluster status

Metric: ha_cluster_pacemaker_nodes
Cluster node failure
Description: Monitors Pacemaker cluster health, node status. Identifies potential cluster configuration issues or failures.	Use-case: Investigate and failover resources to healthy node or bring offline node online. Ensure SAP ASCS and HANA primary database run in same AZ.

Metric: ha_cluster_pacemaker_fail_count
Cluster node failover count
Description: Tracks resource failover count between Pacemaker cluster nodes. Monitor fail count increases/patterns to identify potential issues, ensuring SAP HA environment stability.	Use-case: Identifies frequent failovers indicating cluster instability/issues. Aids capacity planning, measures failover success rate during testing and analysis. Non-zero count expected in HA clusters.

Metric: ha_cluster_corosync_quorate, ha_cluster_corosync_ring_errors
Cluster nodes lose communication
Description: Detect split-brain scenarios to prevent data corruption. ha_cluster_corosync_quorate tracks quorum status between nodes. ha_cluster_corosync_ring_errors monitors communication issues like network problems or node failures causing ring errors in cluster.	Use-case: Monitor quorum loss from network disruptions, node failures, misconfigured settings. Track ring errors indicating node communication issues. Resolve persistent ring errors promptly to prevent cluster failures or corruption. Occasional transient ring errors expected.

Metric: ha_cluster_pacemaker_stonith_enabled
Cluster STONITH (Shoot The Other Node In The Head) agent status
Description: The metric is related to the STONITH (also known as fencing) mechanism in Pacemaker clusters. The fencing mechanism allows cluster to forcibly shut down or isolate misbehaving or unresponsive node.	Use-case: STONITH is recommended in production to prevent corruption on failures or network disruptions. May be disabled for maintenance or testing. It helps address split-brain issue.

Figure-5: SAP HA metrics dashboard for SAP ASCS and ERS cluster

Figure-6: SAP HA metrics dashboard for SAP HANA database cluster

Availability metrics for monitoring SAP application and database servers

The previous scenarios focused on HA availability metrics for SAP ASCS and ERS cluster. This section covers availability metrics for SAP NetWeaver Application, HANA database and HSR. Application Insights provides metrics to monitor SAP availability to minimize downtime during planned maintenance, and unplanned disruptions.

Metrics: sap_alerts_availability, sap_alerts_BasisSystem, sap_alerts_Database, sap_alerts_SqlError, sap_alerts_AbortedJobs, sap_alerts_Security, sap_alerts_System, sap_alerts_LongRunners, sap_alerts_State, and sap_alerts_Shortdumps.
SAP system availability
Description: These metrics are used to monitor the availability SAP system, components and services such as SAP Application servers, SAP Primary Application Server, and SAP database.	Use-case: With these metrics, identify availability impact to SAP components, system errors, short dumps, audit and security messages, failed background jobs and client-side errors.

Metric: sap_alerts_Database
SAP HANA database
Description: In HA environment with SAP HANA primary and standby database deployed across multiple AZs, the sap_alerts_Database metric can help detect when a failover or failback event.	Use-case: This sap_alerts_Database metricis used to ensure that the SAP system is properly configured to connect to the new database instance after a failover. Problems with the network connection or database server availability can cause SQL errors when the SAP system attempts to execute database queries or transactions.

Metric: sap_alerts_Database

SAP HANA database

Description: In HA environment with SAP HANA primary and standby database deployed across multiple AZs, the sap_alerts_Database metric can help detect when a failover or failback event.

Use-case: This sap_alerts_Database metricis used to ensure that the SAP system is properly configured to connect to the new database instance after a failover.

Problems with the network connection or database server availability can cause SQL errors when the SAP system attempts to execute database queries or transactions.

Metric: sap_alerts_SqlError
SAP HANA database
Description: The sap_alerts_SqlError metric is related to SQL errors encountered by the SAP system when interacting with the database.	Use-case: Persistent SQL errors can impact the availability of the SAP application, making it crucial to monitor sap_alerts_SqlError metric and set appropriate alerts to ensure timely response and resolution.

Figure-7: SAP NetWeaver Availability metrics

Metric: sap_start_service_processes
SAP processes availability
Description: The metric is used to monitor the status and availability of critical background processes and services required for the proper functioning of an SAP system. Ensure SAP System Availability by monitoring the status of essential SAP services like the message server, gwrd (Gateway read process), enqueue server and icman (Internet Communication Manager), administrators can quickly identify if any of these processes are not running or have encountered issues.	Use-case: With this metric detect potential problems before they escalate into major outages or system failures. If a critical service process is not running, it can be an early indicator of underlying issues that need to be addressed.

Figure-8: SAP Process availability metrics

Metric: sap_enqueue_server_replication_state, sap_enqueue_server_locks_max
SAP Enqueue server replication and locks
Description: In HA deployment, where the enqueue locks are replicated across multiple nodes, monitoring the sap_enqueue_server_replication_state ensure that the enqueue replication is functioning correctly, allowing for seamless failover and failback operations. Monitoring the sap_enqueue_server_locks_max metric track number of active enqueue locks. When this enqueue locks approaches or exceeds the maximum limit, it could indicate that the system is running out of enqueue resources. This can lead to performance degradation or even system outages, as new enqueue requests may be denied or blocked.	Use-case: If you encounter issues related to enqueue replication, such as data inconsistencies or failures during failover or failback processes, monitoring the sap_enqueue_server_replication_state metric can help identify the root cause of the problem. The sap_enqueue_server_locks_now metric can help detect high number of locks and indicate contention for resources, which could lead to performance issues in the SAP system.

Metric: sap_enqueue_server_replication_state, sap_enqueue_server_locks_max

SAP Enqueue server replication and locks

Description: In HA deployment, where the enqueue locks are replicated across multiple nodes, monitoring the sap_enqueue_server_replication_state ensure that the enqueue replication is functioning correctly, allowing for seamless failover and failback operations.

Monitoring the sap_enqueue_server_locks_max metric track number of active enqueue locks. When this enqueue locks approaches or exceeds the maximum limit, it could indicate that the system is running out of enqueue resources. This can lead to performance degradation or even system outages, as new enqueue requests may be denied or blocked.

Use-case: If you encounter issues related to enqueue replication, such as data inconsistencies or failures during failover or failback processes, monitoring the sap_enqueue_server_replication_state metric can help identify the root cause of the problem.

The sap_enqueue_server_locks_now metric can help detect high number of locks and indicate contention for resources, which could lead to performance issues in the SAP system.

Figure-9: SAP Enqueue Server replication and locks

Metric: hanadb_hsr_replication_status
SAP HANA System Replication
Description: The metric is important for ensuring the HA and DR capabilities of the SAP HANA system. It provides visibility into the current state of the replication process, allowing administrators to take appropriate actions based on the reported status.	Use-case: If the metric shows an “Error” state, administrators may need to investigate and resolve any issues with the replication process to ensure that the secondary systems have an up-to-date copy of the data and are ready for failover or DR operations. If the metric indicates an “Inactive” state during a period when replication should be active, it may signal a potential issue that needs to be addressed.

Metric: hanadb_hsr_log_shipping_delay_seconds
SAP HANA System Replication
Description: The metric measures the delay or latency in the log shipping process. It represents the time difference, in seconds, between when a transaction log is written on the primary system and when it is successfully applied or replayed on the secondary system(s).	Use-case: In the event of a failover, the secondary system must have the most recent data to ensure a seamless transition, minimize data loss and minimize downtime. Monitoring the log shipping delay using the metric helps ensure that the secondary systems are ready for failover with minimal data loss.

Metric: hanadb_hsr_secondary_active_status
SAP HANA System Replication
Description: The metric provides information about the current state or activity of the secondary system(s) in the HSR setup.	Use-case: If the metric shows an “Error” state for a secondary system, administrators may need to investigate and resolve any issues to ensure that the secondary system is ready for failover or DR operations. If the metric indicates an “Inactive” state during a period when replication should be active, it may signal a potential issue that needs to be addressed.

Metric: hanadb_hsr_secondary_failover_count
SAP HANA System Replication
Description: The metric keeps track of the number of times a failover has occurred from the primary system to a specific secondary system. This metric is typically associated with each individual secondary system in the HSR setup.	Use-case: A secondary system that has experienced multiple failovers may require additional attention or maintenance to ensure its readiness for future failover events. Monitoring the metric failover count can help administrators prioritize maintenance activities or identify potential issues with specific secondary system.

Metric: hanadb_hsr_secondary_reconnect_count
SAP HANA System Replication
Description: The metric keeps track of the number of times a specific secondary system has successfully reconnected to the primary system after a disconnection event. This metric is typically associated with each individual secondary system in the HSR setup.	Use-case: If a secondary system has an unexpectedly high number of reconnections, it may indicate underlying issues that need to be investigated and resolved. The metric reconnect count can be correlated with other metrics, logs, and network monitoring data to identify potential root causes and improve the overall reliability of the HSR setup.

Figure-10: SAP HSR Metric dashboard for SAP HANA SYSTEMDB and TENANTDB

Metric: hanadb_webdispatcher_service_started_status
SAP Web Dispatcher
Description: The metric monitors status of Web Dispatcher service deployed in two AZs in a HA deployment. Web Dispatcher service is required for client requests to be routed to SAP HANA database and application servers.	Use-case: If client requests are resulting in errors, the metric status will identify if Web Dispatcher service is offline.

Figure-11: SAP Web Dispatcher status

Availability metrics for AWS infrastructure

AWS provides a comprehensive set of services that enable organizations to run and operate SAP workloads with HA on the cloud. Amazon EC2, Amazon Elastic Block Storage (EBS), and Amazon Elastic File System (EFS) EFS deliver highly available compute and storage resources, while SAP HANA and supported database vendors are hosted on EC2 high-memory instances. Application Insights for SAP collects comprehensive infrastructure metrics, logs, and events from SAP systems, enabling performance monitoring and alerting. Below is list of metrics critical related to AWS Infrastructure.

Metric: StatusCheckFailed_System, StatusCheckFailed_Instance, StatusCheckFailed_AttachedEBS
EC2 and EBS
Description: Amazon EC2 provides three types of status checks: system status checks metric StatusCheckFailed_System, instance status checks metric StatusCheckFailed_Instance, and attached EBS status check metric StatusCheckFailed_AttachedEBS. These checks help identify issues with the underlying AWS infrastructure or the instance itself and attached volumes.	Use-case: If there are issues with SAP cluster nodes or SAP application servers, check the Amazon EC2 instance status checks. If there are issues with SAP HANA database, in addition to instance status checks, check the status of attached EBS volumes.

Figure-12: Amazon EC2 Status checks and Alarms

Figure-13: Amazon EC2 metrics – Status Checks for Instance, System and EBS volume

Metric: VolumeStalledIOCheck, VolumeQueueLength, VolumeReadOps, VolumeWriteOps
CloudWatch EBS metrics
Description: The VolumeStalledIOCheck metric is an Amazon EBS volume status check that indicates whether the volume is experiencing stuck or stalled I/O operations. The VolumeQueueLength metric can indicate potential performance bottlenecks or I/O contention for the EBS volume. Monitor Amazon EBS volume performance for reads using VolumeReadOps, and writes using VolumeWriteOps.	Use-case: If the queue length remains high for an extended period, it can lead to increased latency and potentially impact the performance and availability of the SAP workloads relying on that volume. A stalled EBS volume can lead to database failure and result in database cluster failover action.

Metric: VolumeStalledIOCheck, VolumeQueueLength, VolumeReadOps, VolumeWriteOps

CloudWatch EBS metrics

Description: The VolumeStalledIOCheck metric is an Amazon EBS volume status check that indicates whether the volume is experiencing stuck or stalled I/O operations. The VolumeQueueLength metric can indicate potential performance bottlenecks or I/O contention for the EBS volume.

Monitor Amazon EBS volume performance for reads using VolumeReadOps, and writes using VolumeWriteOps.

Use-case: If the queue length remains high for an extended period, it can lead to increased latency and potentially impact the performance and availability of the SAP workloads relying on that volume. A stalled EBS volume can lead to database failure and result in database cluster failover action.

Figure-14: Amazon EBS – CloudWatch metric

Metric: VolumeAvgReadLatency, VolumeAvgWriteLatency, VolumeIOPSExceededCheck, VolumeThroughputExceededCheck
EBS metrics
Description: Volume IOPS Exceeded Check and Volume Throughput Exceeded Check monitors if the driven IOPS or throughput is exceeding the provisioned performance of your Amazon EBS volume. VolumeAvgReadLatency and VolumeAvgWriteLatency metric monitors the performance insight into the average latency of the I/O being driven on your EBS volumes.	Use-case: Volume IOPS Exceeded Check and Volume Throughput Exceeded Check metric will identify and help respond to latency issues stemming from under provisioned EBS volumes that may impact the performance of your applications. VolumeAvgReadLatency and VolumeAvgWriteLatency metric will identify performance bottlenecks and ensure your applications are resilient to performance impacts

Metric: VolumeAvgReadLatency, VolumeAvgWriteLatency, VolumeIOPSExceededCheck, VolumeThroughputExceededCheck

EBS metrics

Description: Volume IOPS Exceeded Check and Volume Throughput Exceeded Check monitors if the driven IOPS or throughput is exceeding the provisioned performance of your Amazon EBS volume.

VolumeAvgReadLatency and VolumeAvgWriteLatency metric monitors the performance insight into the average latency of the I/O being driven on your EBS volumes.

Use-case: Volume IOPS Exceeded Check and Volume Throughput Exceeded Check metric will identify and help respond to latency issues stemming from under provisioned EBS volumes that may impact the performance of your applications.

VolumeAvgReadLatency and VolumeAvgWriteLatency metric will identify performance bottlenecks and ensure your applications are resilient to performance impacts

Figure-15: Amazon EBS – CloudWatch metric

Metric: DataWriteIOBytes, DataReadIOBytes, PercentageOfPermittedThroughputUtilization
EFS metrics
Description: DataWriteIOBytes and DataReadIOBytes tracks the amount of data written to and read from the file system, respectively. Monitoring these metrics can help identify potential performance bottlenecks or spikes in I/O activity that could impact the availability of the file system. Metric PercentageOfPermittedThroughputUtilization monitors the NFS throughput utilization.	Use-case: If the SAP Application or Database server hang when attempting to list, write or read files from NFS hosted in EFS could signal potential performance and availability issues. A drop in data write/read metrics could also indicate an availability issue. If there are performance issues in writing to NFS, check throughput utilization metric value and take actions to increase throughput to accommodate peak workload.

Figure-16: Amazon EFS – CloudWatch metrics

To learn more about AWS infrastructure metrics for respective services, please visit Amazon EC2 CloudWatch Application Insights metrics, Amazon EBS CloudWatch Application Insights metrics, Amazon EFS CloudWatch Application Insights metrics, and Amazon FSx CloudWatch Application Insights metrics.

To learn about Application Insights recommended metrics, please visit this page.

SAP problem summary dashboard as starting point of root cause analysis

The Application Insights problem summary dashboard provides a centralized view of CloudWatch metrics and monitoring data from your SAP systems in HA Deployment. It consolidates relevant metrics, logs, and data from AWS Infrastructure, HA Cluster, SAP NetWeaver, and SAP HANA database into a unified interface. This enables you to quickly identify and diagnose issues across your entire SAP system and AWS infrastructure.

The dashboard leverages machine learning capabilities to analyze aggregated metrics and logs, automatically detecting anomalies, patterns, and potential problems. This proactive approach helps you stay ahead of issues before they escalate, allowing timely remedial actions to maintain optimal application performance.

Application Insights for SAP, powered by SageMaker, uses classification algorithms and built-in rules to detect application issues by analyzing metric anomalies, logs, and traces. The problem summary dashboard provides contextual information to identify ongoing issues with your SAP system. The enhanced visibility into SAP application health helps reduce Mean Time To Repair (MTTR) for troubleshooting. Application Insights for SAP is pre-configured to ingest logs and traces from HA cluster nodes, SAP application servers, and HANA database servers to correlate events and identify ongoing events impacting SAP availability.

Figure-17: Problem Summary Dashboard features

The Problem Summary Dashboard provides a comprehensive overview of the issues and problems detected within the SAP environment running on AWS.

Prioritizes detected issues based on severity, allowing administrators to quickly identify and resolve the most critical problems.
Details on the issue, including problem description, affected resources, metrics, and logs, aiding in understanding the problem’s nature and scope for efficient troubleshooting.
Analyzes and insights into each issue’s potential impact on the SAP environment, SAP components, business processes, and user groups.
Displays historical trends and patterns of detected issues, helping identify recurring problems or correlations with environmental changes.
May offer recommended remediation steps and best practices for specific issues, helping administrators efficiently mitigate and resolve problems.
Generates alerts/notifications for new issues or existing issues reaching severity levels, promptly notifying administrators for timely response and mitigation.

Let’s deep-dive into 3 use-cases highlighted below in Problem summary dashboard related to SAP Availability, SAP Cluster and SAP HSR. With these use-cases you will learn how to get started with root-cause analysis, leverage machine learning driven insights from metrics and logs to identify source of failure, and take actions to resolve it.

Figure-18: Problem Summary Dashboard

Use-case-1: Problem summary dashboard has reported “SAP HANA: SAP Clustering error” problem. Select the reported problem to view details.

Figure-19: Problem Summary Dashboard – Problem details for SAP HANA database cluster error

Step-1: Problem summary section highlights key details such as the source of the issue, in this case EC2 instance id, status (Recovering), and severity (Medium).

Figure-19a: Problem Summary Dashboard – Problem details for SAP HANA database cluster error

Step-2: Insights help you troubleshoot the root-cause of the problem. The recommended actions states sapstartsrv was either stopped, killed or not available. To troubleshoot, CloudWatch logs contains more information in step-3.

Figure-19b: Problem Summary Dashboard – Problem details for SAP HANA database cluster error

Step-3: CloudWatch Log Insights has relevant pacemaker cluster logs for affected SAP component (HANA database). The pacemaker log indicates one of the database cluster node has failed with details such as EC2 instance id, error message and timestamp of failure.

Figure-19c: Problem Summary Dashboard – Problem details for SAP HANA database cluster error

Resolution: To resolve this error, SAP BASIS administrator will take actions with pacemaker cluster commands to start the offline database node, perform resource actions, monitor cluster status metrics, and resolve the cluster error.

Use-case 2: Problem Summary Dashboard has reported “SAP HANA: HANA System Replication inactive” problem. Select the reported problem to view details.

Figure-20: Problem Summary Dashboard – Problem details for SAP HSR error

Step-1: Problem summary section highlights source of the issue in this case is HANA database component, status (Recovering), and severity (Medium) of this issue.

Figure-20a: Problem Summary Dashboard – Problem details for SAP HSR error

Step-2: Insights has steps outlined to start HANA database and check status of HSR. This also includes commands to start database with HDB start command and run SystemReplicationStatus.py script to check health of HSR.

Figure-20b: Problem Summary Dashboard – Problem details for SAP HSR error

Step-3: The metric related to HSR replication (hanadb_hsr_replication_status, hanadb_hsr_log_shipping_delay_seconds) indicates HSR is offline (metric value = 1). The metrics with red outline indicates alarm status for tenant database (HDB), and the green outline indicates the metrics are not in alarm status yet for system database (SYSTEMDB).

Figure-20c: Problem Summary Dashboard – Problem details for SAP HSR error

Step-4: CloudWatch Log Insights has detected connection refused error in HANA database resulting in HSR to fail. The log messages contain the EC2 instance id, detailed error message and the timestamp of failure.

Figure-20d: Problem Summary Dashboard – Problem details for SAP HSR error

Resolution: The SAP BASIS administrator will take steps to start standby database node managed by pacemaker cluster and start HANA database to resolve HSR error. And continue to monitor HSR metrics to ensure the two databases are back in-sync status.

Use-case 3: Problem Summary Dashboard has reported “SAP Availability” problem. Select the reported problem to view details.

Figure-21: Problem Summary Dashboard – Problem details for SAP Availability error

Step-1: Problem summary section contains source of the issue which is SAP NetWeaver component, status (In progress), and severity (High) of this issue.

Figure-21a: Problem Summary Dashboard – Problem details for SAP Availability error

Step-2: Insights indicates an availability issue with SAP application server instances and recommends checking SAP system health with t-codes (SM21, SM51, SM66, and CCMS) for availability issues.

Figure-21b: Problem Summary Dashboard – Problem details for SAP Availability error

Step-3: The SAP NetWeaver application availability and performance metrics (sap_alerts_Database, sap_alerts_Availability, sap_alerts_SqlError, sap_alerts_ResponseTimeDialogRFC, sap_alerts_BasisSystem, sap_start_service_process) indicates an ongoing issue with SAP system and issues connecting to database. The metrics outlined in red are in alarm state.

Figure-21c: Problem Summary Dashboard – Problem details for SAP Availability error

The SAP availability metrics (ha_cluster_pacemaker_fail_count, hanadb_level_4_alerts_count) indicates HANA database node failure. The metrics outlined in red are in alarm state and in green are not in alarm state yet. Figure-21d: Problem Summary Dashboard – Problem details for SAP Availability error

Step-4: CloudWatch Log Insights indicates pacemaker cluster has recorded HANA database node failure events with details such as the EC2 instance id, detailed error message, and timestamp of failure.

Figure-21e: Problem Summary Dashboard – Problem details for SAP Availability error

In addition, pacemaker cluster has recorded events that indicates one of the database nodes is offline with details such as EC2 instance id, detailed error message, and timestamp of failure.

Figure-21f: Problem Summary Dashboard – Problem details for SAP Availability error

Resolution: SAP BASIS administrator will take steps to bring up the database node managed by the pacemaker cluster and check health of SAP Application by monitoring the metrics.

Incident Detection and Response (IDR) for SAP

Customer can implement observability by themselves, but we highly recommend AWS IDR to jumpstart their SAP Observability journey. IDR for SAP proactively monitors SAP workloads 24/7, detects critical incidents, and engages customer teams to guide mitigation, and recovery efforts. IDR for SAP helps reduce failure opportunities and accelerates recovery from critical incidents for onboarded SAP workloads. AWS IDR for SAP is available as an add-on to Enterprise Support customers with key benefits such as:

Proactive 24/7 monitoring of your SAP workloads by AWS for SAP experts
Automated detection of critical AWS and SAP incidents
Rapid engagement with your SAP operations teams during incidents via conference calls
Guidance from AWS for SAP experts on mitigation and recovery steps
Helps reduce chances of failures in customer workloads
Accelerates recovery from critical incidents
Allows customers to focus on their business while AWS handles incident management

IDR for SAP is ideal for mission-critical workloads, complex distributed architectures, organizations with limited cloud expertise, meeting compliance requirements, optimizing costs by avoiding downtime, and allowing businesses to focus on core objectives by offloading incident management to AWS experts for ensuring HA of critical cloud workloads.

To get started with AWS Incident Detection and Response (IDR) for SAP, eligible Enterprise Support customers start by raising a service request, provide SAP workload details, collaborate with AWS IDR for SAP team on onboarding, engagement model, runbooks, and training. Once onboarded, the AWS IDR for SAP team continuously monitors the SAP workload and engages the customer’s team for critical incidents, following defined procedures. Ask your Technical Account Manager (TAM) to get started today!

To learn more, please visit AWS Incident Detection and Response User Guide or contact your AWS account representative.

Conclusion

Application Insights for SAP can enhance availability for your mission-critical SAP workloads in a HA deployment. In this blog, we dive deep into metrics and use-cases to proactively detect issues through robust logging, tracing. By adopting Application Insights for SAP’s comprehensive observability features and aligning with the SAP Lens of the AWS Well-Architected Framework, organizations can strive to achieve the highest levels of SAP availability and responsiveness, ensuring uninterrupted business operations.

By leveraging the problem summary dashboard in Application Insights for SAP, administrators can gain a comprehensive understanding of the issues impacting their SAP environment, prioritize their efforts, and take informed actions to maintain the availability, performance, and reliability of their SAP workloads running on AWS.

To accelerate observability and incident response for SAP workloads, enroll in IDR for SAP. With IDR for SAP, you can leverage AWS and SAP expertise for 24/7 monitoring, automated incident detection, rapid engagement, and guided mitigation to minimize downtime and ensure business continuity for critical workloads.

Credits

I would like to thank the following team members for their contributions: Derek Ewell, Spencer Martenson, Venkat Tatavarthy, Ravi Iyer, Somckit Khemmanivanh, and Balaji Krishna.

Select your cookie preferences

AWS for SAP