Containers

Accelerating feature delivery and improving reliability for a semi-stateful, memory-bound workload

This blog post was co-written by William Ho, Software Engineer, Airtable.

Introduction

Airtable is a connected applications platform that lets teams and enterprises build flexible interfaces and compose automations on top of their key data. Airtable provides so much flexibility that customers use Airtable for the most critical workflows across their organization. Today, half of Fortune 500 companies use Airtable, and as it continues to grow, Airtable is doubling down on their product investments to ensure it continues to meet the needs of their enterprise customers.

The service orchestration team at Airtable was tasked with architecting the deployment infrastructure to ensure seamless scalability as the product user base grew. As part of this, the Service Orchestration team knew their mandate was clear, they had to:

  1. Automate the process of deploying new code while ensuring the data and user security
  2. Build reliable, easy-to-use frameworks to allow software development teams to deploy and operate their services in a self-service manner, with guardrails in place

In 2021, the Airtable Service Orchestration team decided to migrate their code from binaries on self-managed Amazon Elastic Compute Cloud (Amazon EC2) instances to containerized deployments using Kubernetes on Amazon Elastic Kubernetes Service (Amazon EKS). As part of this, the team ramped up on Amazon EKS, with support from AWS Solutions Architects and Specialists, to create, operate, and upgrade Kubernetes clusters so their service owners (i.e., software development teams) don’t have to worry about the infrastructure (in this case Kubernetes) where their code was running.

A semi-stateful workload service

The Airtable application is partitioned as user bases. A user base is like a workbook in a traditional spreadsheet, and can contain multiple tables of content or data. Airtable built a worker service to serve its user bases, and this service is their largest workload. Inside this containerized service, lives a parent process that spawns multiple child processes inside every container. Each child process runs the business logic for one base and hosts an in-memory store of all data pertaining to that base, while the parent routes incoming requests to the appropriate child and maintains a pool of idle children for faster base loads.

When loading a base workload into memory, the data is read from Amazon Relational Database Service (Amazon RDS) instances — a process that takes over five minutes for complex bases — during which time no requests can be served. Therefore, it was important to minimize the frequency of base reloads to ensure a good and consistent user experience. Since Amazon RDS is the persistent store for the base data, a worker pod can be gracefully terminated without data loss. Hence, in that sense, the base workload service is not strictly a stateful workload (i.e., the data isn’t being persisted to an Amazon Elastic Block Store (Amazon EBS) volume or any other external data store). However, it does pull in all data for a customer’s base, which may take several minutes to complete, and each base is served by at most one worker process at a time, so this workload is semi-stateful in nature.

To ensure the application met the performance and customer experience requirements, the service was deployed as 10 identical containers per pod, with one pod per node. This presents a bit of an antipattern for Kubernetes-based deployments, but it was a necessary control to minimize product changes as the team moved off bare-bones self-managed Amazon EC2 deployments. In addition, reading all data into memory greatly simplified the business logic, reduced network latency and transient failures, and removed the need to maintain data consistency across multiple writers. It also sped-up operations such as sorting and filtering data. All of these directly translated to improved and better user experience.

How Amazon EKS addresses some of our current challenges

With the Amazon EC2-based deployments, the worker service code (and all other services) were baked into an Amazon Machine Image (AMI) that was then launched by Amazon EC2 autoscaling groups. The build system took 50–60 minutes to create an AMI, creating the autoscaling groups via AWS CloudFormation took another 15–20 minutes, and starting the actual Amazon EC2 instances took another 10 minutes. This made patching applications a much slower process. In addition, there was no concept of liveness checks when running on Amazon EC2 instances, so the team had built and maintained custom code to watch dashboards carefully for availability and latency regressions. This was undifferentiated heavy lifting that was taking time away from value added activities that the team could potentially focus on. This prompted their decision to move to Kubernetes. By using an industry-standard container orchestration framework, the Airtable Service Orchestration team was able to apply best practices and hire industry experts. Kubernetes also provided features out of the box such as liveness checks, rolling updates, easy autoscaling, node draining, and placement preferences that were key requirements for the new platform. After a short evaluation period, the team decided to move to Amazon’s managed Kubernetes service, Amazon EKS, to avoid the overhead of deploying and managing the Kubernetes control plane. The team appreciated the ease of creating and operating k8s clusters with Amazon EKS with the managed control plane and integrations with other AWS services like AWS Identity and Access Management (AWS IAM) and Amazon Virtual Private Cloud (Amazon VPC) Security Groups.

In addition, they were looking to make the deployment of new code versions safer and easier by using capabilities such as rolling updates and Kubernetes liveness checks to replace their home-grown code. The team also wanted the ability to run canary analysis automatically and implement faster rollouts, with the ability to rapidly roll back a deployment when bugs are discovered in production. Thereby improving the safety of deployments and reducing site downtime.

In the long term, they aspired to have individual software development teams update their services on their own schedules, instead of having the Service Orchestration team roll out a monolithic deployment that updates all services.

Some of the metrics that the team was targeting for improvement include:

  1. Reduce the false positive or negatives with canary analysis
  2. Reduce the time to roll out a new code version
  3. Reduce the number of new code deployments that need human intervention
  4. Number of single-service hotfixes that were performed by service owners instead of Airtable

Technical challenges

The team ran into a few technical challenges with Amazon EKS early on, arising from the semi-stateful nature of their workload. One of the first was that by default, Amazon EKS and Kubernetes assume pods are fungible, so terminating pods for bin-packing and inter-Availability Zone (AZ) balancing is a standard and acceptable process. This didn’t work for their application because each base is served by a specific pod, so pod churn had to be minimized. To get around these challenges, Airtable disabled the AZ-Rebalance process on Amazon EKS-Managed Node Groups for the underlying autoscaling groups. They also created mega-pods with 10 identical containers per pod and tuned their CPU/memory requests to ensure only one pod fits on a node, and therefore the Cluster Auto Scaler can reclaim underutilized nodes without causing pod churn because each node is either full or empty. Airtable is using {r6a, r6i, r5}.4xlarge for their memory capacity and CPU to memory ratio. Furthermore, their base-to-pod routing system was modified to prefer pods running new code versions during rolling updates, thus minimizing the number of times a base is reloaded.

From an upgrade standpoint, Airtable was required to update their AMI every two weeks to apply key security patches. They also had code deployments that updated the containers on a bi-weekly basis. The automated rolling node upgrade process with Amazon EKS-Managed Node Groups resulted in base reloads, which required a read back into memory from Amazon RDS, outside regularly scheduled deployments that deteriorated the user experience. To minimize the number of updates, the team decided to roll the deployment and AMI updates concurrently. They included minor Kubernetes updates in these bi-weekly updates, and they adopted a blue-green node group update process to automatically create and scale new node groups ahead. The old node groups were then tainted to prevent new pods from landing on them. Similarly, the team has adopted blue/green updates at the cluster level when upgrading their Kubernetes major version.

With the new architecture, the team anticipated a large increase in the number of required IP addresses because the Amazon EKS VPC CNI networking model assigns each pod its own routable IP address. To get around this, /20 subnets were allocated in each AZ the number of IPs cached by each node’s IP Address Management (IPAMD) was reduced to two, thus avoiding the need to create a new VPC just to get larger Classless Inter-Domain Routing (CIDR) blocks.

Note: The default IPAMD config caches a full Elastic Network Interface (ENI) + whatever was unused on the partially full ENI. Assuming that’s 30 + 15 cached IPs per node x 1500 nodes at peak times, not even full /16 subnets would have been sufficient.

Another technical challenge came from AWS Service Limits. One of these was rate limits with AWS Secret Manager that limited the rate at which new pods could load secrets during startup. These limits slowed down the rate at which updates could be rolled out or rolled back. The team got around this through a shared secrets cache for pods from the same deployment on a host.

Another issue Airtable encountered during rolling updates was with VPC rate limits and subsequent throttling during ENI (de)allocation meant that they could not create or terminate pods as quickly as they desired. Airtable had a single container-per-pod architecture, which greatly increased the number of pods. During rolling updates, a CreateNetworkInterface or DeleteNetworkInterface call is required per pod to either create or delete the branched ENI assigned to that pod. With numerous pods (before mega-pod re-architecture) they hit rate limits which resulted in throttling. While Airtable requested and had their rate limits increased from 5 per second to 15 per second, they realized that this solution wasn’t scalable and hence decided to re-architect containers into mega-pods, with each pod containing 10 containers as opposed to a single container per pod. This drastically reduced the number of API calls during a rolling update and left ample headroom for future scalability.

They also experienced throttling of DescribeCluster operations when multiple node groups were scaled up at the same time. This resulted in new nodes trying and failing to join the cluster, thereby leading to pods being stuck in a pending state, with the Cluster AutoScaler repeatedly adding and removing nodes to try to obtain enough capacity. The Airtable team addresses this by staggering node group scale ups ahead of updates, thus reducing the request rate and ensuring operations did not get throttled.

Solution overview

Amazon EKS architecture

The Amazon EKS Cluster Architecture is as described in the following diagram.

The deployment started out in us-east-1 with subnets for each workload type in each AZ depending on whether it was:

  1. Private : pods cannot communicate with the wider internet. They can only talk to AWS services (e.g., Amazon RDS, Amazon ElastiCache, Amazon Simple Storage Service [Amazon S3]) and other Airtable-created instances.
  2. Private open outbound: This is similar to private, but pods can make outbound connections through the public internet.
  3. Public: pods can make bidirectional communication with the public internet. Cordoned off from other workload types except for a few allowed targets.

IP addresses: Before the migration to Amazon EKS, raw EC2 instances obtained IP addresses from one of the /22 subnets in each AZ in us-east-1. Each EC2 instance would live for 2–5 days until a deployment occurred by creating new instances with new code baked into the AMI. After the migration, worker pods obtain IP addresses from one of the /20 subnets in each AZ. Security groups are attached to each pod’s branch ENI instead of the host’s main ENI. While worker pods lived 2–5 days, Amazon EKS hosts usually live for 1–2 weeks, after which they are replaced to pick up new security patches.

CoreDNS challenge: When an Amazon EKS cluster is created, by default there are two CoreDNS pods, each requesting 0.1 cores and with no CPU limit. The sheer number of worker pods (up to 15,000 at peak before the mega-pod re-architecture) making Domain Name System (DNS) queries meant that both pods were routinely using far more than the requested CPU amount. This became a problem during rolling updates when worker pods consumed large amounts of CPU time during startup and the headroom that CoreDNS relied upon no longer existed, which showed up as DNS timeouts and spurious NXDOMAIN responses. Airtable addressed this by scaling CoreDNS up to 16 pods, each requesting a full core, while also capping CPU usage at 1 core per pod to provide predictable behavior even when the host is busy.

Node diagram:

Benefits with Amazon EKS

Faster software upgrades: Since the move to Amazon EKS, Airtable has observed a speed-up in rollouts and rollbacks and reduced the time to deploy by more than 50%. It now takes us 20 minutes to create a new docker image and upload it to Amazon Elastic Container Registry (Amazon ECR), thereafter another 20 minutes to perform a rolling update into the Amazon EKS clusters. Rollbacks take only 20 minutes because the desired image already exists, and Amazon EKS is smart enough to reuse old pods if a rollback is issued before all old pods have been terminated.

Improved platform reliability: Amazon EKS deployments include guardrails and monitoring, which makes rolling out new code safer because rolling updates will automatically stall if new pods cannot start up. Combined with canary analysis, 67% of potential incidents are caught before they make it to production. Canary deployments are now easy to create, and with automated canary analysis, humans get involved only when anomalies are detected. This saves about 6 hours of time per week from not having to manually inspect dashboards.

Improved developer productivity: With Amazon EKS, individual service teams can now independently deploy their services, whereas earlier only the service orchestration team could deploy services, thereby removing a bottleneck in the release process. Letting service owners perform their own hotfixes with a few clicks removes friction and reduces the time from checking in a bug fix to it going live in production, a capability that has been greatly beneficial in accelerating bug fixes several times in the past few quarters.

“The new platform on EKS has greatly improved the efficiency of the team. We were able to move to a faster cadence of releases, identify and resolve issues before they cause significant customer pain, and thereby improve our customer and developer experience”

– Alexander Sorokin, Lead Architect, Airtable

What’s next

Looking ahead, Airtable wants to make deployments more agile by breaking up their current monolithic deployments and delegating each service’s deployment schedule to the individual teams. This would allow development teams to push code to production independently without involvement from the Service Orchestration team. This would further reduce the lead time from code check-in to production from days to hours.

Furthermore, Airtable is planning to fully automate the code deployment process and enable tuning of Canary deployments to ensure bugs are detected reliably and releases automatically rolled back when the overall site availability metrics have degraded after an update. These updates would allow Airtable to eliminate delays that arise from human response time (post-upgrade) that could deteriorate the user experience.

"Headshot

William Ho, Airtable

William Ho is a Software Engineer at Airtable working on Service Orchestration. Using past experience on infrastructure projects, William helps teams across the company migrate their services to Kubernetes and finds ways to make the service operation experience efficient and safe. Outside of work, he loves badminton, hiking with friends, and reading about new technology.