Containers

Improving operational visibility with AWS Fargate task retirement notifications

Introduction

AWS Fargate, the serverless compute engine for containerized workloads, removes the undifferentiated heavy lifting of securing and patching the underlying infrastructure. In this blog post we dive into AWS Fargate task retirement, one of the ways AWS keeps the infrastructure secure and up to date.

AWS has recently updated the AWS Fargate task retirement process, consolidating the notifications customers receive about upcoming retirements, and rolling out a mechanism to allow customers to control the time between a notification and a task retirement. In this post, we will explore these changes in more detail and provide an example of how to use these notifications to achieve operational excellence.

Background

When deploying an Amazon Elastic Container Service (Amazon ECS) task on to AWS Fargate, a platform version is specified in the Amazon ECS service or standalone task API call. A platform version refers to the runtime environment of the host operating system, a combination of the kernel and container runtime. Within a platform version, there is an internal construct known as a platform version revision. Platform version revisions are immutable and released as the runtime environment evolves, for example, if there are kernel bug fixes or security updates. Every time a new task is scheduled on to AWS Fargate, it is always launched on to the latest revision of the specified platform version.

Over time, AWS may determine that an existing platform version revision that is supporting running tasks needs to be retired. When a revision is retired, all tasks running on that revision will be stopped by AWS Fargate. There are a number of reasons why a revision may need to be retired, including security vulnerabilities and performance improvements. In the past AWS Fargate has retired one to two platform version revisions each month, however there is no fixed support period for a particular platform version revision. Due to the typical life span of a platform version revision, customers with short lived workloads will experience far fewer task retirements then a customer with a task running for multiple weeks.

platform_version_revision_lifecycle

The diagram above shows the full lifecycle of an AWS Fargate platform version revision. Once a new platform version revision is launched, all new tasks will be scheduled on to this revision. Existing tasks that have already been scheduled and running will remain on the revision they were originally placed on for the duration of the task and will not be migrated to the new revision. If the task is replaced, for example as part of an update to an ECS service or AWS Fargate task retirement, the new task will be placed on to latest platform version revision available at the time of launch.

Task retirement

The following diagram shows the end-to-end process of AWS Fargate task retirement.

task_retirement_process

When AWS marks a platform version revision as needing to be retired, we identify all of the tasks that are running on that platform version revision in all AWS Regions. We then send out one notification per account per Region, highlighting the affected tasks or services and a date when the retirements will start to take place. The notification is sent via email to the primary email contact on the AWS account, as well as to the AWS Health Dashboard. Notifications sent to the AWS Health Dashboard can be forwarded through Amazon EventBridge to AWS services or third party tools.

Once a notification has been sent, a customer has a period of time (known as the task retirement wait period) to take manual action if they want to control the exact timing before AWS Fargate initiates the automatic task retirement process. When AWS Fargate stops a task, if the task is part of an ECS service, it will be stopped respecting the service’s minimumHealthyPercent value. For standalone tasks, it is customers’ responsibility to monitor the state of running tasks and start replacements.

To minimalize the impact of AWS Fargate task retirement, workloads should be deployed following Amazon ECS best practices. For example, when deploying a stateless application as an Amazon ECS service, such as a web or API server, customers should deploy multiple task replicas and set the minimumHealthyPercent to 100%. Therefore, when AWS Fargate starts retiring tasks, Amazon ECS will first schedule a new task and wait for it to be running, before retiring an old task.

For more information on the task retirement process, see the AWS Fargate documentation.

Task retirement wait period

The length of the task retirement wait period can now be controlled by a new Amazon ECS AccountSetting, fargateTaskRetirementWaitPeriod. Before AWS Fargate will stop a task for task retirement, customers can leverage the task retirement wait period to stop tasks on their own schedule, for example if they have workloads that can only be stopped in a specific window.

The task retirement wait period can be configured to one of the set time intervals in the table below. We recommend biasing towards a shorter wait period where possible, to pick up new platform version revisions sooner.

Days Action
0 AWS sends the notification and immediately starts to retire affected tasks.
7 AWS sends the notification and waits 7 calendar days before starting to retire affected tasks.
14 AWS sends the notification and waits 14 calendar days before starting to retire affected tasks.

In the rare scenario of a critical security update, AWS Fargate may override this task retirement wait period, sending a task retirement notification and immediately retiring the affected tasks. Mirroring the effect of setting the fargateTaskRetirementWaitPeriod to 0.

The existing fargateTaskRetirementWaitPeriod value can be seen with the aws ecs list-account-settings command.

$ aws ecs list-account-settings \
    --name fargateTaskRetirementWaitPeriod \
    --effective-settings
{
    "settings": [
        {
            "name": " fargateTaskRetirementWaitPeriod",
            "value": "14",
            "principalArn": "arn:aws:iam::123456789012:root"
        }
    ]
}

The fargateTaskRetirementWaitPeriod can be configured with the aws ecs put-account-setting-default command.

$ aws ecs put-account-setting-default \
    --name fargateTaskRetirementWaitPeriod \
    --value 14

For more information on the task retirement wait time, see the task retirement and the Amazon ECS AccountSetting documentation.

Solution overview: Capturing task retirement notifications

When there is an upcoming task retirement, AWS sends a task retirement notification to the AWS Health Dashboard and to the primary email contact on the AWS account. The AWS Health Dashboard provides a number of integrations into other AWS services, including Amazon EventBridge. By leveraging Amazon EventBridge customers can build automations from a task retirement notification, such as increasing the visibility of the upcoming retirement by forwarding the message to a ChatOps tool.

AWS Health Aware is a great resource in showing the power of the AWS Health Dashboard and how notifications can be distributed throughout an organization. In this walkthrough, we take inspiration from this project and forward a task retirement notification to the chat application, Slack. The notifications are captured by an EventBridge rule looking for events with the Event Detail Type: "AWS Health Event" and the Event Detail Type Code: "AWS_ECS_TASK_PATCHING_RETIREMENT". Once the rule has captured a notification, it will trigger an AWS Lambda function that parses the event for affected resources and forwards it to a Slack Incoming Webhook.

The following diagram below shows the high-level architecture of this solution.

task_retirement_notifications_walkthrough_architecture

Prerequisites

To complete the walkthrough, the following prerequisites need to be in place:

  • An existing Slack workspace with the Incoming Webhook Slack application installed and enabled.
  • An AWS account with the relevant permissions to deploy an Amazon EventBridge rule and AWS Lambda function.
  • The AWS SAM CLI installed and configured on a local development workstation.

Solution walkthrough

  1. The sample code of the walkthrough is stored in a GitHub repository. The first step of this walkthrough is to clone the repository to a local development workstation.
$ git clone https://github.com/aws-samples/capturing-aws-fargate-task-retirement-notifications.git
$ cd capturing-aws-fargate-task-retirement-notifications
  1. Next, we build and deploy the Lambda function and the EventBridge rule defined in an AWS SAM template cloudformation.yaml. Note you will need to enter parameters in to the deployment wizard, including your Slack workspace URI and Slack channel.
$ sam build --template cloudformation.yaml 
$ sam deploy --guided
Configuring SAM deploy
======================

     Looking for config file [samconfig.toml] :  Not found

    Setting default arguments for 'sam deploy'
    =========================================
    Stack Name [sam-app]: FargateTaskRetirementNotifications
    AWS Region [eu-west-1]: eu-west-1
    Parameter SlackWorkspaceURL []: https://hooks.slack.com/services/workspace/app/token
    Parameter SlackChannel []: platform-eng
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]:        
    #Preserves the state of previously provisioned resources when an operation fails
    Disable rollback [y/N]: 
    Save arguments to configuration file [Y/n]: 
    SAM configuration file [samconfig.toml]: 
    SAM configuration environment [default]:
  1. Test it! Here we send two sample events to Amazon EventBridge to ensure everything is working correctly. Because we are unable to simulate AWS Health notifications, we will instead trigger the workflow by creating EventBridge events that match the EventBridge rule. There are two events in the sample repository, one for tasks attached to an Amazon ECS service and one for standalone tasks.
$ aws events put-events --entries file://sample_service_event.json
$ aws events put-events --entries file://sample_task_event.json
  1. In your Slack workspace, you should now see two Slack notifications, one for each test event.

task_retirement_slack_1

task_retirement_slack_2

Clean Up

To clean up the sample walkthrough, use the AWS SAM CLI to remove the CloudFormation stack with $ sam delete.

Conclusion

In this blog post, we dived deep into the AWS Fargate task retirement process. We have shown how the task retirement wait period can be adjusted if customers want to control the time between a notification and a retirement. Finally, we have shown how customers can capture task retirement notifications with Amazon EventBridge and AWS Lambda. To learn more about the AWS Fargate task retirement, please the AWS Fargate documentation.