AWS Cloud Operations Blog

Streamlining the Correction of Errors process using Amazon Bedrock

Generative AI can streamline the Correction of Errors process, saving time and resources. By using generative AI to leverage large language models, combined with the Correction of Errors process, businesses can expedite the identification and documentation of the cause of errors, while saving time and resources.

Purpose and set-up

The purpose of this blog is to showcase where generative AI can provide the greatest impact on the Correction of Errors (CoE) process. It will not create a fully automated CoE application. Generative AI, and the applications that interact with generative AI, have the ability to utilize tools to automate data gathering. Although this blog will not discuss how to create these mechanisms, it will call out the possibilities.

As we look at the anatomy of the Correction of Errors document, generative AI can streamline the creation of many of the sections within the document. For the purpose of this blog, we will rely on human input to provide the facts of the event. We will use those facts, and the general knowledge of a large language model (LLM), to generate the yellow highlighted sections in the diagram shown in Figure 1.

The sections in green, created by generative AI, include:

  • Impact Statement
  • 5 Whys
  • Action Items
  • Summary

Although automation could be created to gather the data for the Metrics, Incident Questions, and Related Items sections, generative AI did not meaningfully improve the current processes. Therefore, the sections in blue were not included in this blog.

This is a diagram of a Correction of Errors process flow chart. It shows that the Timeline and Facts sections will be created by human input; the 5 Whys, Impact, Action Items, and Summary sections will be created by generative AI; Metrics, Incident Questions, Related Items are created through Standard COE Process

Figure 1: Correction of Errors Process Flow Chart

For the purpose of this blog, we created a variable called “facts”. This variable will be populated from human input and will contain a general description of what happened. We will leverage generative AI to turn the “facts” into the first draft of your CoE.

Generative AI introduction

Generative AI uses large language models, called LLMs, to respond to natural language processing. This means that the LLM can understand, and reply, in the conversational style humans use. We will not go into the science of the LLMs, but will touch on how to use them more efficiently. The diagram in Figure 2 shows an example process flow of a generative AI application. The user provides input to the generative AI application. The application captures the human input, packages it with instructions for the LLM, to create a prompt. The LLM sends a response back to the application. The application formats the response and sends it to the user.

This is a screenshot of the application process flow. The user makes a request to the generative AI application. The application calls the LLM. The LLM sends a response to the generative AI application. The generative AI application sends a response to the user.

Figure 2: Application process flow

There are many techniques to improve the response from the LLM. “Prompt Engineering“ is an important part of effectively using generative AI. We will show examples, but the topic is too lengthy of a subject to address within the constraints of this blog.

Prerequisites

In order to understand the steps taken we should first explain the technical environment we are using. We are using Jupyter notebooks, setup on a laptop, to create our prompts. (The Figure images in the sections that follow are taken from those Jupyter notebooks.) The Jupyter notebook makes API calls, using Boto 3, to Amazon Bedrock to invoke Anthropic Claude 3 Sonnet the foundational model we used. Amazon Bedrock is a fully managed service that provides a single API to access and utilize various high-performing foundation models (FMs) from leading AI companies. It offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI practices.

Now that we understand the technologies being used, let’s start creating the Correction of Errors document. To get started, we need the facts from the incident.

Inputs – the facts

The variable “facts” has been set to the simulated human input (Figure 3). In a real-world application, you would use the variable to capture the input directly from the user.

This is a screenshot of code setting a variable called facts to the following. Over 10,000 files that were processed were not successfully transformed. The customers received a message that their uploads were successful, but the data was never reflected in the application. The event started at 9:38:18 am (GMT-5) and was resolved at 11:38:24 (GMT-5)

Figure 3: Facts of the incident

Timeline

The timeline of events will also be provided by human input. To simulate this, the variable “timeline” has been set to the simulated human input. In a real-world application, you would use the variable to capture the input directly from the user.

This is a screenshot of code setting a variable called timeline to the details of a series of events. These events start with an application push to production, a transformer Lambda error increase, causing call center customer complaints to surge with the engineering team being notified. The engineers search all logs, notice the error rate in the Transformer Lambda logs and create a patch that is successfully tested and then deployed. The last event states the system recovery is complete. All events have a timeline stamp of the exact time and date that they occurred within.

Figure 4: Timeline of the incident

General Correction of Errors document prompt build up

In order to interact with the LLM as effectively as possible, we need to add instructions to the prompt. There are some general instructions that apply to many prompts. For those, we created three variables called “task_context”, “task_tone”, and “task_rules”. We set the “task_context” variable to “You will be acting as an IT Executive.” We set the “task_tone” variable to “You should maintain a professional tone.” We set the “task_rules” variable to “You must avoid blaming. You must avoid punishing. Your answer will be part of the incident report.” We will use those variables in our prompts to the LLM. The following screenshot (Figure 5) shows the prompt engineering buildup of these general instruction variables.

This is a screenshot of prompt engineering statements being set to variables to be used later in the API calls to Amazon Bedrock. The prompts build up included: task_context, task_tone, background_data (to include the timeline of the incident) and the task_rules.

Figure 5: Prompt engineering general prompt build up

Impact section specific task rules and prompt

We added prompt engineering instructions specific to the impact section. We created a variable “impact_task_rules”. We set the variable to “Your answer should be concise. Your answer should be a paragraph. Your answer should include the impact analysis alone. Your answer should not have more than 200 words”. The last two lines of the screen shot consolidate the general, and impact section specific, prompt engineering statements into one variable “impact_prompt” which will be used in the API call to the LLM.

This is a screenshot of prompt engineering statements, specific to the Impact section prompt, being set to variables to be used in API calls to Amazon Bedrock. It includes the impact_task_rules as stated in the first paragraph of this section and is followed by the impact_prompt. These lines read as: impact_task_request=f”Create a business impact analysis for this <facts>{facts}</facts>”

Figure 6: Impact section specific task rules and prompt

In Figure 7, variables “max_tokens”, “temperature”, “model_id”, and “impact_body” are set to the configurations for the LLM. In the API request body, the “text” field is set to the variable “impact_prompt” that we defined in the previous paragraph.

This is a screenshot of an API request body, specific to the Impact section prompt, sent to Amazon Bedrock. The max_tokens = 2000, temperature=1, the model_id= “anthropic.claude-3-sonnet-20240229-v1:0” and the impact_body references the aforementioned model_id, max_tokens, and temperature, with the addition of “messages”: [{ “role”: “user”, “content”: [{ “type”: “text”, “text”: impact_prompt }] }] }

Figure 7: Impact section API request body

Now we are ready to make an API call to Amazon Bedrock. The API code below instructs the LLM to generate the output for the Impact section.

This is a screenshot of the API call, specific to the Impact section prompt, to Amazon Bedrock. The code is as follows: #Impact Generative AI API Call impact_response=br.invoke_model(body=json.dumps(impact_body),modelId=model_id) impact_result = json.loads(impact_response.get(“body”).read()) print(impact_result[‘content’][0][‘text’])

Figure 8: Impact section API call to Amazon Bedrock

Using the API call, the following output was created.

The incident resulted in a significant business impact. Approximately 10,000 customer files were not successfully processed, despite the customers receiving confirmation messages indicating successful uploads. This discrepancy led to incomplete data in the application, potentially causing customer dissatisfaction, loss of trust, and potential revenue implications. The duration of the event, lasting around two hours, further compounded the impact. During this period, customer inquiries and support requests likely increased, straining resources and affecting overall operational efficiency. The reputational damage and potential loss of confidence from customers could have long-term consequences on customer retention and acquisition efforts. A thorough investigation is necessary to identify the root cause and implement preventative measures to mitigate similar incidents in the future.

5 Whys section specific prompt rules

Each section, of the Correction of Errors document, will follow a similar process. Section specific prompts will be added to the general prompts and captured in a variable. The variable will then be used in an API call. In the interest of brevity, this blog will not show each section specific API call.

The following screenshot (Figure 9) shows the “5 Whys” section specific prompts. The variable “five_whys_task_rules” is set to “Skip the preamble”. The variable “five_whys_request” is set to the output of an instruction for the LLM. The instruction includes the “facts” variable we previously assigned. Finally, the general, and 5 Whys section specific, prompt engineering statements are consolidated into one variable “5_whys_prompt” which will be used in the API call to the LLM.

This is a screenshot of prompt engineering statements specific to the 5 Whys section, being set to variables to be used in API calls to Amazon Bedrock. The 5 Whys prompt section reads as: five_whys_request=f“Suggest a 5 whys root cause path for this <facts>{facts}</facts>” five_whys_prompt=task_context+task_rules+background_data+five_whys_task_rules+five_whys_request

Figure 9: 5 Whys prompt engineering

Using an API call similar to the one described earlier, the following output was created.

Why were over 10,000 files not successfully transformed?
Because there was an issue with the Transformer Lambda function.

Why was there an issue with the Transformer Lambda function?
Because there was an error or bug in the code that caused the Lambda function to fail when processing certain files.

Why did the error or bug in the code cause the Lambda function to fail?
Because the code was not thoroughly tested for all possible input scenarios before being pushed to production.

Why was the code not thoroughly tested for all possible input scenarios?
Because there was a lack of comprehensive test cases or test data that could have caught the issue during the testing phase.

Why was there a lack of comprehensive test cases or test data?
Because the testing process and procedures were not robust enough to identify edge cases or potential issues with different types of input data.

Root Cause: The root cause appears to be an inadequate testing process and procedures, which led to an unidentified bug or error in the Transformer Lambda function’s code. This bug caused the Lambda function to fail when processing certain types of files, resulting in over 10,000 files not being successfully transformed and the data not being reflected in the application, despite the customers receiving a successful upload message.

To prevent similar incidents in the future, it is essential to review and improve the testing processes, procedures, and test data coverage to ensure that all potential scenarios and edge cases are adequately tested before deploying code to production. Additionally, implementing robust monitoring and alerting mechanisms can help identify issues more quickly and facilitate a faster response and resolution.

Action items section specific prompt rules

Figure 10 shows the “Action items” section specific prompts. First it set the variable “action_items_task_rules” to “Skip the preamble. You should provide a list of actions. You should focus on the application. The actions should be S.M.A.R.T. Limit the list to the 3 most impactful actions. Skip the preamble. Do not list the S.M.A.R.T explanation. Be concise. You should suggest a detailed list of subtask for each task.” Next, we set a variable “action_items_request” to the output of an instruction for the LLM. The instruction includes the “five_whys_result” variable we previously assigned. Finally, the general, and action items section specific, prompt engineering statements are consolidated into one variable “action_items_prompt” which will be used in the API call to the LLM.

This is a screenshot of prompt engineering statements specific to the Action items section, being set to variables to be used in API calls to Amazon Bedrock. The “action_items_prompt” used in the API call to the LLM reads as: action_items_request=f“Give me an action list to address this root cause analysis <fivewhys>{five_whys_result[‘content’][0] [‘text’]}</fivewhys>” action_items_prompt=task_content+task_rules+background_data+action_items_task_rules+action_items_request

Figure 10: Action Items prompt engineering

Using an API call similar to the one described earlier, the following output was created.

Enhance Testing and Verification Processes:
– Conduct a comprehensive review of existing test cases and test coverage
– Develop additional test cases to cover edge cases and data transformation scenarios
– Implement rigorous integration and end-to-end testing
– Automate testing processes and incorporate them into the CI/CD pipeline
– Establish a dedicated testing team or assign testing responsibilities to specific individuals
Improve Code Review and Quality Assurance Practices:
– Implement mandatory code reviews for all code changes
– Establish coding standards and best practices
– Conduct regular training sessions on coding practices and quality assurance
– Leverage static code analysis tools to identify potential issues
– Foster a culture of quality and continuous improvement
Enhance Monitoring and Observability:
– Review and improve application logging and error handling mechanisms
– Implement comprehensive monitoring and alerting systems
– Establish clear escalation procedures for incidents and issues
– Conduct regular monitoring and observability reviews
– Provide training on monitoring and observability tools and practices

Summary section specific prompt rules

Figure 11 shows the “Summary” section specific prompts. First the variable “summary_task_rules” is set to “Be concise. Limit your answer to 200 words. Skip the preamble”. Next, the variable “summary_request” is set to the output of an instruction for the LLM. The instruction includes the “facts” and “timeline” variables we previously assigned as well as the “impact_result” and “five_whys_result” variables as captured from outputs of the LLM. Finally, the general, and summary section specific, prompt engineering statements are consolidated into one variable “summary_prompt” which will be used in the API call to the LLM.

This is a screenshot of prompt engineering statements specific to the Summary section, being set to variables to be used in API calls to Amazon Bedrock.

Figure 11: Summary section prompt engineering

Using an API call similar to the one described earlier, the following output was created.

The incident involved over 10,000 files failing to transform successfully after an application deployment, despite customers receiving confirmation of successful uploads. The event lasted 2 hours from 9:38 am to 11:38 am GMT-5 on 5/1/2023.

Engineers initially verified metrics were acceptable but later found increased Transformer Lambda errors after reviewing logs. A patch was deployed to test and production environments, with recovery starting at 11:25 am.

The root cause was insufficient integration and performance testing strategies, leading to an inadequate testing environment and data sets that did not represent production workloads and data characteristics. This allowed an issue with the deployment to go undetected before production release, resulting in the Transformer Lambda errors.

The impact was significant, with potential data loss or inaccuracies, compromised service integrity, and undermined customer trust due to the inconsistency between confirmations and actual data processing. Prompt resolution was crucial to restore data consistency and regain customer confidence.

Conclusion

In this blog, we walked through leveraging generative AI to streamline the Correction of Errors process, saving time and resources. For each section we explained the implementation, and showed example prompts, that were used for this particular scenario.

To start implementing generative AI with your own COE process, we recommend using this blog as a reference and Amazon Bedrock to build generative AI applications with security, privacy, and responsible AI. While we used Claude 3 Sonnet from Anthropic, you should experiment with multiple models to find the one that works best for your company’s use case. We encourage you to start experimenting with generative AI in your Correction of Errors process today.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

Authors:

Juan Ossa

Juan Ossa is currently a Senior Technical Account Manager. He has worked at AWS since 2020. Juan’s focus areas are EC2-Core and Cloud Operations. As part of the AWS team, he provides advocacy and strategic technical direction and enthusiastically keeps his customers’ AWS environments operationally health.

Johnny Hanley

Johnny Hanley is a Solutions Architect at AWS. He has worked at AWS since 2015 in multiple roles. Johnny’s focus areas are Security and the Well-Architected Framework. As part of the AWS Well-Architected team, he works with customers and AWS Partner Network partners of all sizes to help them build secure, high-performing, resilient, and efficient infrastructure for their applications.