How Hapag-Lloyd automated incident management using AWS Step Functions

This post is co-authored by Grzegorz Kaczor and Daniel Steenbock from Hapag-Lloyd AG and Michael Graumann and Daniel Moser from AWS.

Introduction

In today’s fast-paced digital landscape, efficient incident management is crucial for maintaining high-quality customer experiences. In our previous article we discussed how the Web and Mobile department at Hapag-Lloyd established observability for serverless multi-account workloads to enhance visibility in a rapidly evolving AWS environment. We emphasized the importance of establishing robust monitoring solutions as a cornerstone for operational excellence. Having access to logs and receiving incident alerts can be improved with automated incident management that lowers MTTR. While we currently dispatch alert notifications to Microsoft Teams to keep our developers informed, there’s a need to improve our collaboration with the incident management team by automatically informing them about potential incidents, providing relevant contextual data. It’s also important to actively monitor each outage, process the data logically, and then make informed decisions. To achieve this, we’ve leveraged AWS Step Functions and Amazon OpenSearch Service to automate and streamline our incident response process, significantly reducing our Mean Time To Respond (MTTR).

With a fleet of 292 modern container ships and a total transport capacity of 2.3 million TEU, Hapag-Lloyd is one of the world’s leading liner shipping companies. TEU, or Twenty-foot Equivalent Unit, is a unit of measurement used to determine cargo capacity for container ships and ports. In the Liner Shipping Segment, the company has around 13,700 employees and more than 399 offices in 139 countries. Hapag-Lloyd has a container capacity of 3.4 million TEU – including one of the largest and most modern fleets of reefer containers. A total of 113 liner services worldwide ensure fast and reliable connections between more than 600 ports on all the continents.

The company’s Web and Mobile team is a distributed team located in Hamburg and Gdańsk, and responsible for the customer channel’s web and mobile products in the company.

This blog post will guide you through a solution the Web and Mobile team has built to improve their incident management which leads to a lower MTTR by automatically detecting application outages and creating tickets. And by also by gathering relevant contextual information that is provided to the respective application and incident management teams.

Solution

When we automate the creation of tickets that are updated with relevant information to trigger other actions, we want to do that in an orchestrated way. That is where AWS Step Functions come into play.

AWS Step Functions is a serverless workflow service that helps developers build and run distributed applications using visual workflows. It is a powerful tool that simplifies the coordination of complex workflows involving multiple AWS services or external applications.

Step Functions allows you to design and run workflows as a series of steps, each representing a specific task or activity. These steps can be anything from an AWS Lambda function, an AWS API call, or even an external web service invocation. The service provides a visual representation of the workflow, making it easier to understand and manage the flow of execution. Additionally, Step Functions automatically handles retries, error handling, and state management, reducing the complexity of building and maintaining distributed applications. This service is particularly useful for building serverless applications, orchestrating microservices, and automating complex business processes. By utilizing Step Functions, developers can focus on their core application logic while AWS handles the underlying infrastructure and scalability.

We set up a Step Functions workflow for the whole incident handling process. Every time an application outage or incident is detected, one workflow execution is started. In the next sections we will walk you through the steps of the workflow and the journey of a person responsible for resolving an incident response plan.

Overview of the Step Functions workflow to handle incidents

When an application outage is detected, an execution of the workflow is started. Figure 1 illustrates the workflow. When an application outage is detected, an execution of the state machine is started. An example alarm that would trigger the state machine workflow would be because 90% of all requests in the last 5 minutes returned a HTTP 500 error.

A visual representation of the Step Functions workflow used to orchestrate the incident response process. The workflow contains 9 steps, which are described in detail in the following section of the main text.

Figure 1: Step Functions workflow to orchestrate incident response workflow

The workflow then runs in a loop until the incident has been resolved, updating the corresponding ticket in every iteration. The flow is as follows:

As the first step, the workflow validates that the issue is persistent and not only a false positive. This is achieved by introducing a “warmup” phase: the main flow of the workflow only starts when the alarm is active for a minimum time.
Our Amazon CloudWatch alarms provide some basic information, such as the breached threshold and some static details. To gather detailed failure data, the workflow queries Amazon OpenSearch Service (our centralized logging system) for relevant logs, pinpointing the failure of each affected endpoint. This log information is used to enrich notifications and provide context for the outage (for example in Jira tickets or Microsoft Teams messages).
The workflow then checks if a Jira ticket has already been created for the specific incident. Each application can only have one active incident at any given time.
If no Jira ticket has been created for the incident, a new ticket is generated. This ticket contains basic details about the affected product and information gathered from OpenSearch Service regarding the outage.
Once a Jira ticket is created, a notification is sent to the relevant product teams (Figure 2). The goal is to maintain transparency with teams to ensure that they are informed about the incident and its handling.

A notification informing a product team that their application is having an outage. The notification includes information on the application, since when it is in alarm, the ticket ID, and a button to open the ticket.

Figure 2: Notification for product teams that their application has an outage.

If a Jira ticket already exists, it indicates that the workflow is in its second or subsequent iteration. The existing ticket is updated with a comment, which includes the latest status of the incident and updated information from OpenSearch Service.
We maintain an internal status page that tracks the availability of our applications. If the affected application’s status needs to be updated, the workflow will update the status page accordingly, which allows interested stakeholders to track the incident.
At the end of each iteration, the workflow checks if the Jira ticket has already been closed. If the ticket is open, the workflow waits for a predefined time before repeating the process. This loop continues until the issue is resolved and the ticket is closed.
If the ticket is closed, the workflow halts further execution.

User journey

As described in the previous section, when an outage is detected, a ticket is created and responsible teams are notified. The first point of contact usually is the incident management team who will assess the reported incident by reviewing the applications logs, correlating the incident with (potential) other reports, and performing investigation and resolution as documented by the application teams. Where the incident management team is not able to resolve the issue on their own, they escalate to the application team, providing all the information they already collected to support a quick resolution.

Figure 3 shows a ticket that was generated, containing essential information along with appropriate labels assigned to facilitate the identification of responsible individuals.

A Jira ticket which has been created because of a detected outage. The ticket shows essential metadata such as the responsible product team, the application, incident source, impact, and urgency.

Figure 3: Created Jira ticket for an outage

The automation also utilizes the comment section to provide more detailed information about the affected endpoints as well as quick links to relevant logs and dashboards (Figure 4).

Comment section of a Jira ticket providing detailed information on affected endpoints and quick links to logs and dashboards.

Figure 4: Ticket enriched with detailed information on affected endpoints in the comment section

A single workflow execution corresponds to a single application outage and remains active until the ticket is closed. Figure 5 shows how the workflow updated the ticket comments with the latest status, including two new affected endpoints, which indicates that the outage is now impacting other parts of the application as well.

Updated comments section with two new affected endpoints, indicating that the outage is now impacting other parts of the application

Figure 5: Updated comments section

The workflow periodically checks the alarm state. When it detects that the application is functioning properly again, this information is included in the comments as well (Figure 6).

Updated comments section, showing that the outage has been resolved and the application is operational again

Figure 6: Updated comments section after successful resolution

Note that we include URLs for both access and application logs in the ticket. This enables the responsible parties to seamlessly navigate from the ticket to the actual logs, reducing the time it takes to identify the cause of an outage.

After having been notified, our incident management team then utilizes logs and dashboards in OpenSearch Service (Figure 7) to delve deeper into the outage and provide their feedback to the application team, in case they aren’t able to resolve the issue themselves using the available runbooks.

OpenSearch Service dashboard showing the overall application health. The dashboard shows metrics like failure rates, HTTP response code distribution, and latency.

Figure 7: OpenSearch Service Dashboard providing an overall view on application health.

While the incident management or application team is working on resolving the issue, the automation through Step Functions makes sure that the latest state is always tracked in the ticket. This provides a structured approach and also helps to improve runbooks after the event.

Conclusion

In this post, we demonstrated how Hapag-Lloyd’s Web and Mobile team automated their incident management process using AWS Step Functions and Amazon OpenSearch Service. We showed how AWS Step Functions helps with orchestrating the different steps from ticket creation and update, alarm check-ups to root cause analysis and closing of the ticket. This approach not only improves response times but also provides richer context for faster problem resolution, ultimately leading to a lower Mean Time To Respond (MTTR).

To learn more about implementing similar solutions, explore the AWS Step Functions and Amazon OpenSearch Service documentation. Consider how you might adapt this approach to your own incident management workflows or extend it to other operational processes.

Select your cookie preferences

AWS Cloud Operations Blog

How Hapag-Lloyd automated incident management using AWS Step Functions

Introduction

Solution

Overview of the Step Functions workflow to handle incidents

User journey

Conclusion

About the authors

Resources

Follow