Networking & Content Delivery
Using cross-zone load balancing with zonal shift
Today, we’re announcing Amazon Application Recovery Controller (ARC) zonal shift support for Application Load Balancers (ALB) with cross-zone load balancing enabled. This complements the support for Network Load Balancers (NLB) using cross-zone load balancing we announced previously. Now you can use zonal shift with both NLBs and ALBs, with or without cross-zone load balancing configured, as well as with other resources such as Amazon EC2 Auto Scaling groups (ASG) and Amazon Elastic Kubernetes Service (EKS). The blog post Rapidly recover from application failures in a single AZ provided an overview of how zonal shift works and associated best practices when cross-zone load balancing is disabled. This post will provide operational best practices for using zonal shift with cross-zone load balancing enabled.
Overview
To start using zonal shift for ALB or NLB, you must set the load balancer attribute zonal_shift.config.enabled
to true
. For NLBs using cross-zone load balancing, you must also ensure that target_health_state.unhealthy.connection_termination.enabled
is set to false
. With the feature enabled, you can start a zonal shift to mitigate impact when you identify an impairment in a single Availability Zone (AZ).
Zonal shift takes two actions when cross-zone load balancing is enabled. First, it removes the IP address of the load balancer node in the specified AZ from DNS, so new queries won’t resolve to that endpoint. This stops future client requests from being sent to that node. Second, it instructs the load balancer nodes in the other AZs not to route requests to targets in the impaired AZ. Cross-zone load balancing is still utilized in the remaining AZs during the zonal shift, as shown in Figure 1.
You may choose to also perform a zonal shift on your ASG behind the load balancer during an AZ impairment. If you’ve configured the ASG to replace unhealthy instances during a zonal shift, this may result in instances being terminated in the impaired AZ and new instances being launched in the other AZs. It’s also possible EC2 Auto Scaling scales out your application during the zonal shift and launches those new instances in the unaffected AZs. This can create a capacity imbalance among your AZs.
When you determine the AZ impairment has ended, you can cancel the shift and rebalance traffic into the AZ. Cross-zone load balancing helps make rebalancing safer when you have a capacity imbalance because the overall traffic percentage received per target will decrease when you end the load balancer zonal shift. This happens because the load balancer distributes traffic evenly across each target in your target group, as shown in the Figure 2.
In contrast, cross-zone disabled load balancing distributes traffic evenly to each AZ. The load balancer then distributes requests across available targets in that zone. A capacity imbalance among AZs can cause certain instances to receive more load than others after you end the load balancer zonal shift. This could lead to overload and impact to your application. For example, Figure 3 shows how the instance in AZ 2 is receiving approximately twice as much traffic as the targets in AZ 1 and AZ 3. In this configuration, it’s important to use target_group_health.dns_failover.minimum_healthy_targets.count
to prevent the AZ from accepting traffic until enough healthy hosts are available.
Cross-zone enabled load balancing is the default for ALBs and can optionally be enabled for NLBs. This allows you to take advantage of zonal shift without having to make large-scale changes to the configuration of your ALB target groups. You can also opt-in to zonal autoshift for your ALBs in their default configuration. AWS starts an autoshift when internal telemetry indicates that there is an AZ impairment that could potentially impact customers. You can use zonal autoshift in conjunction with the weighted random routing algorithm. This helps you minimize recovery time during an event, and reduces the additional observability you need to take advantage of zonal shift.
While zonal autoshift and Automatic Target Weights (ATW) anomaly mitigation are the preferred ways to react to single-AZ impacts, these tools may not detect certain infrastructure gray failures or single-AZ application impairments. For example, an application deployment containing a bug that was deployed to a single AZ, or a small amount of packet loss impacting a handful of instances that starts causing application errors. You may need to develop additional observability to detect these situations. In the next section, I examine how to detect single-AZ impairments with cross-zone load balancing enabled.
AZ observability for zonal shift with cross-zone load balancing enabled
Monitoring metrics such as request count, fault rate, and latency per AZ are a prerequisite to determining when an AZ may be experiencing an impairment, and allow you to safely mitigate potential impact. The following three signals can help you know when to use zonal shift.
- AZ health metrics showing availability or latency impact.
- The AZ is an outlier for fault rate or latency compared to the other AZs.
- The fault rate or high latency is caused by more than a single instance.
Let’s review how you can start collecting metrics about the health of your application in each AZ.
Creating AZ health metrics
One of the observability best practices for resilience is to monitor your customer experience with synthetic canaries. These act as an early-warning indicator so you can notify yourself of a problem before your customers do. In the post Rapidly recover from application failures in a single AZ, we used Amazon CloudWatch synthetics to monitor the zonal endpoints of your ALBs and NLBs to produce per-AZ metrics, as shown in Figure 4.
Synthetics are still a best practice with cross-zone load balancing enabled. However, it’s not as useful to test each zonal endpoint for an ALB or NLB because the response could come from a target in any AZ. Instead, for ALBs, you can use the ALB load balancer Amazon CloudWatch metrics to identify when targets in a specific AZ show elevated fault rates or latency. ALB target metrics provide 2XX
, 3XX
, 4XX
, and 5XX
counts as well as a metric for TargetResponseTime
. All of these metrics have AvailabilityZone
as a metric dimension, which represents the AZ of the target that produced the response.
For NLBs it can be more difficult to determine changes in application health because its target metrics are mostly layer 4 information. You could monitor the TCP_Target_Reset_Count
metric as a possible proxy to application health, but this may still be insufficient. When cross-zone load balancing is enabled on your NLB or its target groups, you should utilize custom server-side metrics that provide the target’s AZ as a metric dimension. Refer to Publishing custom metrics and the CloudWatch embedded metric format for more details on how to achieve this.
You can also monitor the UnHealthyHostCount
target metric for your load balancers. If the AZ impairment is causing targets to fail their health checks, this is a direct signal of that impact. To automatically respond to this metric, you can use the target_group_health.dns_failover.minimum_healthy_targets.count
attribute for your NLB or ALB target groups. This ensure the load balancer automatically shifts away from an AZ when there are too few healthy hosts.
Using either ALB metrics or custom server-side metrics, you can create CloudWatch alarms to alert you to impacts in each AZ. In this example, I am using ALB metrics with cross-zone load balancing enabled. I configure the alarms to be triggered when latency from targets exceeds a certain threshold or availability drops below a specified value.
The latency alarm uses the following metric (Figure 5):
And the availability alarm uses metric math to determine the fault rate for the AZ (Figure 6):
Finally, I configure a CloudWatch composite alarm to identify either availability or latency impact in a single AZ, as shown in Figure 7.
Next, I will use the same ALB metrics to compare fault rate and latency among each AZ to know when a single AZ is an outlier.
Performing outlier detection
When one AZ is an outlier for a health metric, this can be a good indication that there is a problem localized to that fault isolation boundary. There are a number of different outlier detection algorithms you can use to compare health metrics like chi-squared, z-score, interquartile range (IQR), and median absolute deviation (MAD). A simpler way to get started is to use a static value like 66%, meaning that if one AZ is responsible for 66% of the total faults, it is considered an outlier.
Figure 8 shows a CloudWatch metric, e1
, calculated using metric math. It determines a single AZ’s, us-east-1b in this case, percentage of overall faults. I can set an alarm on this metric when the value is greater than .66.
For latency, I use z-score, which determines how many standard deviations away from the average a data point is. 99.7% of normally distributed data falls within 3 standard deviations, so exceeding a value of 3 would indicate the value is an outlier. This calculation looks at p99 latency and uses the averages from the 2 other AZs I’m comparing this AZ against (using the Metrics() math function) to ensure the outlier latency doesn’t skew the standard deviation. Figure 9 shows the calculation using CloudWatch metric math. I can set an alarm on this metric when it exceeds a value of 3.
Identifying multi-instance impact
If your targets are failing their health checks, the UnHealthyHostCount
target metric can help identify if the impact is being caused by more than one instance. If you are producing structured CloudWatch logs, you can also use CloudWatch Contributor Insights. This service helps determine the number of contributors to faults or latency in your application using the UniqueContributors
metric for your insights rule. Figure 10 shows an example of a CloudWatch metric using Contributor Insights metric math:
You can set an alarm on this metric when the value exceeds 1 (you may want to use a larger number depending on the size of your fleet) to indicate more than one instance is experiencing errors.
Putting it all together
You now have alarms for the three conditions that help identify single-AZ impact:
- Availability or latency impact in the AZ
- The AZ is an outlier for faults or latency
- The impact is being experienced by multiple instances
A final CloudWatch composite alarm, shown in Figure 11, will combine the signal from each of these to tell you when there is single-AZ impact that you can use zonal shift to respond to.
These per AZ alarms can also be added to your dashboards to provide operators quick identification of single-AZ impairments (Figure 12).
Conclusion
In this post I reviewed how zonal shift works with cross-zone load balancing enabled. I also shared operational best practices for monitoring impact to your application’s health in a single AZ. To get started with zonal shift or zonal autoshift, check out Amazon Application Recovery Controller’s documentation.