Using cross-zone load balancing with zonal shift

Today, we’re announcing Amazon Application Recovery Controller (ARC) zonal shift support for Application Load Balancers (ALB) with cross-zone load balancing enabled. This complements the support for Network Load Balancers (NLB) using cross-zone load balancing we announced previously. Now you can use zonal shift with both NLBs and ALBs, with or without cross-zone load balancing configured, as well as with other resources such as Amazon EC2 Auto Scaling groups (ASG) and Amazon Elastic Kubernetes Service (EKS). The blog post Rapidly recover from application failures in a single AZ provided an overview of how zonal shift works and associated best practices when cross-zone load balancing is disabled. This post will provide operational best practices for using zonal shift with cross-zone load balancing enabled.

Overview

To start using zonal shift for ALB or NLB, you must set the load balancer attribute zonal_shift.config.enabled to true. For NLBs using cross-zone load balancing, you must also ensure that target_health_state.unhealthy.connection_termination.enabled is set to false. With the feature enabled, you can start a zonal shift to mitigate impact when you identify an impairment in a single Availability Zone (AZ).

Zonal shift takes two actions when cross-zone load balancing is enabled. First, it removes the IP address of the load balancer node in the specified AZ from DNS, so new queries won’t resolve to that endpoint. This stops future client requests from being sent to that node. Second, it instructs the load balancer nodes in the other AZs not to route requests to targets in the impaired AZ. Cross-zone load balancing is still utilized in the remaining AZs during the zonal shift, as shown in Figure 1.

A zonal shift is performed on AZ 1 which prevents targets there from receiving requests. Cross-zone load balancing is still utilized in the remaining AZs to route requests.

Figure 1 – An Application Load Balancer using cross-zone load balancing with a zonal shift active in AZ 1

You may choose to also perform a zonal shift on your ASG behind the load balancer during an AZ impairment. If you’ve configured the ASG to replace unhealthy instances during a zonal shift, this may result in instances being terminated in the impaired AZ and new instances being launched in the other AZs. It’s also possible EC2 Auto Scaling scales out your application during the zonal shift and launches those new instances in the unaffected AZs. This can create a capacity imbalance among your AZs.

When you determine the AZ impairment has ended, you can cancel the shift and rebalance traffic into the AZ. Cross-zone load balancing helps make rebalancing safer when you have a capacity imbalance because the overall traffic percentage received per target will decrease when you end the load balancer zonal shift. This happens because the load balancer distributes traffic evenly across each target in your target group, as shown in the Figure 2.

An Application Load Balancer with cross-zone load balancing enabled. This routes the same amount of traffic to each target.

Figure 2 – Application Load Balancer with cross-zone load balancing enabled

In contrast, cross-zone disabled load balancing distributes traffic evenly to each AZ. The load balancer then distributes requests across available targets in that zone. A capacity imbalance among AZs can cause certain instances to receive more load than others after you end the load balancer zonal shift. This could lead to overload and impact to your application. For example, Figure 3 shows how the instance in AZ 2 is receiving approximately twice as much traffic as the targets in AZ 1 and AZ 3. In this configuration, it’s important to use target_group_health.dns_failover.minimum_healthy_targets.count to prevent the AZ from accepting traffic until enough healthy hosts are available.

An Application Load Balancer with cross-zone load balancing disabled. Because there is a capacity imbalance, the instance in AZ 2 receives twice as much traffic as the targets in AZ 1 and AZ 3.

Figure 3 – Application Load Balancer without cross-zone load balancing

Cross-zone enabled load balancing is the default for ALBs and can optionally be enabled for NLBs. This allows you to take advantage of zonal shift without having to make large-scale changes to the configuration of your ALB target groups. You can also opt-in to zonal autoshift for your ALBs in their default configuration. AWS starts an autoshift when internal telemetry indicates that there is an AZ impairment that could potentially impact customers. You can use zonal autoshift in conjunction with the weighted random routing algorithm. This helps you minimize recovery time during an event, and reduces the additional observability you need to take advantage of zonal shift.

While zonal autoshift and Automatic Target Weights (ATW) anomaly mitigation are the preferred ways to react to single-AZ impacts, these tools may not detect certain infrastructure gray failures or single-AZ application impairments. For example, an application deployment containing a bug that was deployed to a single AZ, or a small amount of packet loss impacting a handful of instances that starts causing application errors. You may need to develop additional observability to detect these situations. In the next section, I examine how to detect single-AZ impairments with cross-zone load balancing enabled.

AZ observability for zonal shift with cross-zone load balancing enabled

Monitoring metrics such as request count, fault rate, and latency per AZ are a prerequisite to determining when an AZ may be experiencing an impairment, and allow you to safely mitigate potential impact. The following three signals can help you know when to use zonal shift.

AZ health metrics showing availability or latency impact.
The AZ is an outlier for fault rate or latency compared to the other AZs.
The fault rate or high latency is caused by more than a single instance.

Let’s review how you can start collecting metrics about the health of your application in each AZ.

Creating AZ health metrics

One of the observability best practices for resilience is to monitor your customer experience with synthetic canaries. These act as an early-warning indicator so you can notify yourself of a problem before your customers do. In the post Rapidly recover from application failures in a single AZ, we used Amazon CloudWatch synthetics to monitor the zonal endpoints of your ALBs and NLBs to produce per-AZ metrics, as shown in Figure 4.

Amazon CloudWatch synthetic canaries testing the ALBs nodes in each AZ

Figure 4 – Synthetic canaries running against the Application Load Balancer endpoints in each AZ

Synthetics are still a best practice with cross-zone load balancing enabled. However, it’s not as useful to test each zonal endpoint for an ALB or NLB because the response could come from a target in any AZ. Instead, for ALBs, you can use the ALB load balancer Amazon CloudWatch metrics to identify when targets in a specific AZ show elevated fault rates or latency. ALB target metrics provide 2XX, 3XX, 4XX, and 5XX counts as well as a metric for TargetResponseTime. All of these metrics have AvailabilityZone as a metric dimension, which represents the AZ of the target that produced the response.

For NLBs it can be more difficult to determine changes in application health because its target metrics are mostly layer 4 information. You could monitor the TCP_Target_Reset_Count metric as a possible proxy to application health, but this may still be insufficient. When cross-zone load balancing is enabled on your NLB or its target groups, you should utilize custom server-side metrics that provide the target’s AZ as a metric dimension. Refer to Publishing custom metrics and the CloudWatch embedded metric format for more details on how to achieve this.

You can also monitor the UnHealthyHostCount target metric for your load balancers. If the AZ impairment is causing targets to fail their health checks, this is a direct signal of that impact. To automatically respond to this metric, you can use the target_group_health.dns_failover.minimum_healthy_targets.count attribute for your NLB or ALB target groups. This ensure the load balancer automatically shifts away from an AZ when there are too few healthy hosts.

Using either ALB metrics or custom server-side metrics, you can create CloudWatch alarms to alert you to impacts in each AZ. In this example, I am using ALB metrics with cross-zone load balancing enabled. I configure the alarms to be triggered when latency from targets exceeds a certain threshold or availability drops below a specified value.

The latency alarm uses the following metric (Figure 5):

The ALB target response time metric is used to measure latency.

Figure 5 – Using the target response time metric to define a latency alarm per AZ

And the availability alarm uses metric math to determine the fault rate for the AZ (Figure 6):

Using ALB HTTPCode_Target_5xx_Count and RequestCount metrics to determine the fault rate.

Figure 6 – Determining the load balancer fault rate to define an availability alarm per AZ

Finally, I configure a CloudWatch composite alarm to identify either availability or latency impact in a single AZ, as shown in Figure 7.

A CloudWatch composite alarm looking at either fault rate impact or latency impact.

Figure 7 – A CloudWatch composite alarm definition for either fault rate or latency impact

Next, I will use the same ALB metrics to compare fault rate and latency among each AZ to know when a single AZ is an outlier.

Performing outlier detection

When one AZ is an outlier for a health metric, this can be a good indication that there is a problem localized to that fault isolation boundary. There are a number of different outlier detection algorithms you can use to compare health metrics like chi-squared, z-score, interquartile range (IQR), and median absolute deviation (MAD). A simpler way to get started is to use a static value like 66%, meaning that if one AZ is responsible for 66% of the total faults, it is considered an outlier.

Figure 8 shows a CloudWatch metric, e1, calculated using metric math. It determines a single AZ’s, us-east-1b in this case, percentage of overall faults. I can set an alarm on this metric when the value is greater than .66.

Determing an single AZ's percentage of the overall fault rate with CloudWatch metrics.

Figure 8 – Creating a metric to determine the percentage of faults belonging to a single AZ for outlier detection

For latency, I use z-score, which determines how many standard deviations away from the average a data point is. 99.7% of normally distributed data falls within 3 standard deviations, so exceeding a value of 3 would indicate the value is an outlier. This calculation looks at p99 latency and uses the averages from the 2 other AZs I’m comparing this AZ against (using the Metrics() math function) to ensure the outlier latency doesn’t skew the standard deviation. Figure 9 shows the calculation using CloudWatch metric math. I can set an alarm on this metric when it exceeds a value of 3.

Calculating the z-score for a single AZ using TargetResponseTime metrics.

Figure 9 – Calculating the z-score for latency in a single AZ to discover outliers

Identifying multi-instance impact

If your targets are failing their health checks, the UnHealthyHostCount target metric can help identify if the impact is being caused by more than one instance. If you are producing structured CloudWatch logs, you can also use CloudWatch Contributor Insights. This service helps determine the number of contributors to faults or latency in your application using the UniqueContributors metric for your insights rule. Figure 10 shows an example of a CloudWatch metric using Contributor Insights metric math:

A CloudWatch Contributor Insights metric math rule to determine how many instances are producing faults in a single AZ.

Figure 10 – A CloudWatch metric using Contributor Insights to calculate the number of contributors to faults in a single AZ

You can set an alarm on this metric when the value exceeds 1 (you may want to use a larger number depending on the size of your fleet) to indicate more than one instance is experiencing errors.

Putting it all together

You now have alarms for the three conditions that help identify single-AZ impact:

Availability or latency impact in the AZ
The AZ is an outlier for faults or latency
The impact is being experienced by multiple instances

A final CloudWatch composite alarm, shown in Figure 11, will combine the signal from each of these to tell you when there is single-AZ impact that you can use zonal shift to respond to.

A CloudWatch composite alarm that ensures there is latency or availabiltiy impact in the AZ, the AZ is an outlier for latency or faults, and multiple instances are seeing impact in that AZ.

Figure 11 – The CloudWatch composite alarm definition for determining single-AZ impact

These per AZ alarms can also be added to your dashboards to provide operators quick identification of single-AZ impairments (Figure 12).

A CloudWatch dashboard showing 2xx count per AZ, fault rate per AZ, 5xx count per AZ, and target response time at p99 per AZ.

Figure 12 – A CloudWatch dashboard showing impact being detected in a single AZ

Conclusion

In this post I reviewed how zonal shift works with cross-zone load balancing enabled. I also shared operational best practices for monitoring impact to your application’s health in a single AZ. To get started with zonal shift or zonal autoshift, check out Amazon Application Recovery Controller’s documentation.

About the author

Michael Haken

Michael is a Senior Principal Solutions Architect on the AWS Strategic Accounts team where he helps customers innovate, differentiate their business, and transform their customer experiences. He has over 15 years’ experience supporting financial services, public sector, and digital native customers. Michael has his B.A. from UVA and M.S. in Computer Science from Johns Hopkins. Outside of work you’ll find him playing with his family and dogs on his farm.

Networking & Content Delivery