Using zonal shift with Amazon EC2 Auto Scaling

This post is written by Michael Haken, Senior Principal Solutions Architect, AWS

Today, we’re announcing support for zonal shift in Amazon EC2 Auto Scaling. Zonal shift allows you to rapidly recover from application impairments in a single Availability Zone (AZ) impacting your Auto Scaling Group (ASG) resources. In this post, we describe how performing an ASG zonal shift fits in to a multi-AZ resilience strategy and considerations for how to use the feature with different architectures.

Overview

Using multiple AZs is an architectural best practice for building resilient applications on AWS. Deploying your application across multiple AZs makes your applications more available, fault tolerant, and scalable. EC2 Auto Scaling enables you to further enhance your application’s availability and fault tolerance by dynamically scaling your Amazon Elastic Compute Cloud (Amazon EC2) instances across multiple AZs and replacing them when they’re unhealthy.

AZs in AWS represent a fault isolation boundary, meaning that failures from various sources are contained to a single AZ, whether caused by a bad deployment, networking issues, power loss, or operator error. In 2023, we launched zonal shift, part of Amazon Application Recovery Controller (ARC), which allows you to Rapidly recover from application impairments in a single AZ by shifting traffic at your Elastic Load Balancing (ELB) load balancer.

Zonal shift for EC2 Auto Scaling enhances this capability for users who have already implemented recovery patterns for single AZ impairments. It also provides recovery capabilities for architectures that aren’t load balanced by allowing you to prevent new instance launches in a specified AZ. Without zonal shift, when EC2 Auto Scaling detects consistent launch failures in an AZ, the service tries to launch instances in other AZs configured for the ASG. However, certain conditions, like gray failures, can cause post-launch problems in a single AZ that EC2 Auto Scaling doesn’t detect. For example, successfully launched instances in a single AZ experience elevated error rates downloading their configuration files over a zonal Amazon S3, Amazon Virtual Private Cloud (Amazon VPC) interface endpoint. The instances can’t correctly configure their application software and respond to requests with errors. Alternatively, the single-AZ impairment could cause the instance to fail its health checks after provisioning. This causes EC2 Auto Scaling to constantly recycle instances in the impaired AZ, leading to the application running with less capacity than desired.

Although you might choose to perform a zonal shift at your load balancer to mitigate the impact caused by the event, new instances can still be launched in the impacted AZ and don’t receive incoming requests. Even if your application architecture doesn’t use load balancers, zonal shift for EC2 Auto Scaling can help you recover from single-AZ impairments by allowing you to prevent instance launches in the impaired AZ.

Using EC2 Auto Scaling zonal shift to recover

To use zonal shift on your ASG, you need to configure it with an AvailabilityZoneImpairmentPolicy parameter either when you create a new ASG or update an existing one. This parameter has two options, ZonalShiftEnabled that enables or disables the ability to perform zonal shifts, and ImpairedZoneHealthCheckBehaviour. The latter option allows you to choose between ignoring or replacing instances identified as unhealthy by EC2 Auto Scaling. First, we look at how you can use zonal shift with a standalone ASG architecture.

Standalone ASG zonal shift

This architecture uses a standalone ASG without being integrated with an ELB load balancer. Workloads with a standalone ASG commonly perform event driven work such as generating load against a target based on a schedule or processing messages from a queue. The architecture in the following figure uses an ASG that reads messages from an Amazon Simple Queue Service (Amazon SQS) queue, performs some processing on the message data, and writes the results into an Amazon Aurora database. The instances communicate with Amazon SQS using a VPC endpoint in each AZ. Each message varies in size, thus the instances use a heartbeat pattern to update the message visibility timeout until they finish processing it. EC2 Auto Scaling scales instances based on the queue depth, which helps make sure that messages are processed in a timely manner.

Figure 1: EC2 instances deployed across three AZs that process messages from an SQS queue

Say that a networking degradation causes instances in AZ 1 to experience elevated error rates when attempting to write to the Aurora database, resulting in a 2x increase in the p50 processing latency. The instances in AZ 1 continue to heartbeat until they time out, keeping the message hidden and preventing other healthy instances from taking over the work. As a result, the queue depth grows and EC2 Auto Scaling deploys a new instance, as shown in the following figure.

Figure 2: EC2 Auto Scaling launches a new instance in AZ 1 in response to the queue depth growing

The new instance lands in AZ 1 and experiences the same problem as the other instance, thus it can’t decrease the queue depth and processing latency. Instead, it exacerbates the issue by consuming additional messages that aren’t successfully processed. The instances in AZ 1 never appeared unhealthy, thus EC2 Auto Scaling didn’t take any actions to replace them. To mitigate this problem, you can start a zonal shift for your ASG. This makes sure that any future instance launches only happen in AZ 2 or AZ 3, as shown in the following figure.

Figure 3: After the zonal shift new instances are only launched in AZ 2 and AZ 3 by EC2 Auto Scaling

You have the option to mark the instances as unhealthy using the SetInstanceHealth API to force EC2 Auto Scaling to replace these instances to prevent them from continuing to contribute to additional latency and errors. Changing the instance health state is considered a mutating change and relies on the EC2 Auto Scaling control plane. Therefore, you should avoid making this a critical step in your recovery plan. When you are confident that the impairment has abated, you can cancel the zonal shift, which causes EC2 Auto Scaling to automatically rebalance capacity across your AZs.

ASG with ELB zonal shift

In this section we observe how to use zonal shift with an ASG that is serving traffic from an ELB. We also examine how the ImpairedZoneHealthCheckBehavior affects recovery in this situation. In this architecture, the instances in the ASG read data from the database when they receive HTTP requests from the ELB, as shown in the following figure.

Figure 4: A three-tier application deployed in three AZs using an ALB, ASG, and Aurora database

In this scenario, the instances in AZ 1 start experiencing increased latency with their EBS volumes causing them to respond to requests with errors and fail their EC2 instance status checks. Initially, to mitigate the impact, you can start a zonal shift at your load balancer to prevent your users from receiving errors. Then, you can initiate a zonal shift for your ASG to prevent new capacity from being launched into the AZ that isn’t receiving traffic.

If the ASG’s ImpairedZoneHealthCheckBehavior is set to IgnoreUnhealthy, then the instances in AZ 1 that are failing their health checks aren’t terminated by EC2 Auto Scaling, as shown in the following figure. This can be helpful if you’re pre-scaled to handle the loss of an AZ’s worth of capacity by not causing EC2 Auto Scaling to attempt to launch additional instances. It can also make recovery safer by leaving capacity in the AZ, thus when you end your load balancer zonal shift after the impairment ends, the AZ can immediately start receiving traffic again.

Figure 5: Performing a zonal shift on the ALB and ASG, choosing to ignore unhealthy instances in the ASG

Alternatively, you can set the option to ReplaceUnhealthy. Now, instances that are found to be unhealthy by EC2 Auto Scaling are replaced. This option can be helpful if you aren’t pre-scaled to handle the loss of capacity. EC2 Auto Scaling launches new instances into the remaining AZs to bring the ASG back to its desired capacity, as shown in the following figure. However, this approach also has a tradeoff: launching new instances isn’t guaranteed to be successful, thus you might experience delays in acquiring new capacity.

Figure 6: Performing a zonal shift on the ALB and ASG, this time replacing unhealthy instances in the remaining AZs

In both situations you must consider whether you have cross-zone load balancing enabled or disabled. When cross-zone load balancing is enabled, each instance, regardless of its AZ, receives an approximately equal share of the traffic. This means that you can end your zonal shift for both your load balancer and ASG at the same time safely. As EC2 Auto Scaling rebalances your instances across each enabled AZ, they receive the same percentage of traffic.

If cross-zone load balancing is disabled, then each AZ receives an equal percentage of the traffic, regardless of how many instances are in the AZ. If you’ve chosen to replace unhealthy instances, or if your ASG has scaled during the event, then the capacity across your AZs could have become imbalanced. When you end your load balancer zonal shift and EC2 Auto Scaling begins to rebalance your capacity, you could end up in a situation shown in the following figure, where a single or small number of instances gets an overwhelming portion of the load.

Figure 7: A three-tier architecture with an imbalance of capacity among its three AZs

This imbalance can present an overload risk, thus you must specify the --skip-zonal-shift-validation parameter when you enable zonal shift to acknowledge that you understand the risk. However, you can help prevent overload from occurring due to imbalance by using the load balancer’s target_group_health.dns_failover.minimum_healthy_targets.count option and specifying the number of instances that should be present in the AZ. If you’re using three AZs and your desired capacity is 12, then you should set the value to four (which represents one third of the ASGs total capacity). This prevents traffic from being routed to the AZ until there is enough healthy capacity there to handle the load. You may need to dynamically adjust this number as the ASG scales over time. The minimum count you set in the past may not be the right minimum count today.

Zonal shift best practices

As a set of best practices, we recommend that you:

Are pre-scaled to handle the loss of an AZ’s worth of capacity
Configure your impairment policy to ignore unhealthy hosts
Enable cross-zone load balancing

With this configuration, you can also safely use zonal autoshift. When zonal autoshift is enabled, AWS automatically starts and ends the zonal shift on your behalf whenever the AWS telemetry indicates there is an impairment affecting a single AZ. This can be used in conjunction with zonal autoshift for your ELB load balancer. If you are not using zonal autoshift, then you can still use the EventBridge observer notifications to inform your zonal shift decisions or start automated processes. Refer to the EC2 Auto Scaling zonal shift documentation for more details on the full set of best practices when using zonal shift.

Conclusion

In this post we showed you the benefits of using zonal shift with your Amazon EC2 Auto Scaling Groups as part of enhancing your resilience in multi-AZ architectures. We explored several scenarios where zonal shift can be used, and reviewed best practices for using zonal shift safely and effectively. To get started using zonal shift with your ASGs, refer to the documentation.

Select your cookie preferences

AWS Compute Blog