Containers

How Snap Inc. secures its services with Amazon EKS

Introduction

Snapchat is an app that hundreds of millions of people around the world use to communicate with their close friends. The app is powered by microservice architectures deployed in Amazon Elastic Kubernetes Service (Amazon EKS) and datastores such as Amazon CloudFront, Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, and Amazon ElastiCache. This post explains how Snap builds its microservices, leveraging Amazon EKS with AWS Identity and Access Management. It also discusses how Snap protects its K8s resources with intelligent threat detection offered by GuardDuty that is augmented with Falco and in-house tooling to secure Snap’s cloud-native service mesh platform.

The following figure (Figure 1) shows the main data flow when users send and receive snaps. To send a snap, the mobile app calls Snap API Gateway that routes the call to a Media service that persists the sender message media in S3 (Steps 1-3). Next, the Friend microservice validates the sender’s permission to Snap the recipient user by querying the messaging core service (MCS) that checks whether the recipient is a friend (Steps 4-6), and the conversation is stored in Snap DB powered by DynamoDB (Steps 7-8). To receive a snap, the mobile app calls MCS to get the message metadata such as the pointer to the media file. It calls the Media microservice to load the media file from the content system that persists user data, powered by CloudFront and Amazon S3 (Steps 9-11)

naps end-to-end user data flow main data flow when users send and receive Snaps

Figure 1: Snaps end-to-end user data flow main data flow when users send and receive Snaps

The original Snap service mesh design included single tenant microservice per EKS cluster. Snap discovered, however, that managing thousands of clusters added an operational burden as microservices grew. Additionally, they discovered that many environments were underused and unnecessarily consuming AWS account resources such as IAM roles and policies. This required enabling microservices to share clusters and redefining tenant isolation to meet security requirements. Finally, Snap wanted to limit access to microservice data and storage while keeping them centralized in a network account meshed with Google’s cloud.

The following figure illustrates a Kubernetes-based microservice, Friends and Users, deployed in Amazon EKS or Google Kubernetes Engine (GKE). Snap users reach Snap’s API Gateway through Envoy. Switchboard, Snap’s mesh service configuration panel, updates Edge Envoy endpoints with available micro-services resources after deploying them.

Snap’s high mesh design

Figure 2 – Snap’s high level data flow

Bootstrap

The purpose of this stage is the preparation and implementation of a secure multi-cloud compute provisioning system. Snap uses Kubernetes clusters as a construct that defines an environment that hosts one or more microservices, such as Friends, and Users in the first figure. Snap’s security bootstrap includes three layers that include authentication, authorization, and admission control when designing a Kubernetes-based multi-cloud.

Snap uses IAM roles for Kubernetes service accounts (IRSA) to provision fine-grained service identities for microservices running in shared EKS clusters that allow access to AWS services such as Amazon S3, DynamoDB, etc. For operator access scoped to the K8s namespace, Snap built a tool to manage K8s RBAC that maps K8s roles to IAM, allowing developers to perform service operations following the principle of least privileges.

Beyond RBAC and IRSA, Snap wanted to impose deployment validations such as making sure containers are instantiated by approved image registries (such as Amazon Elastic Container Registry (Amazon ECR) or images built and signed by approved CI systems) as well as preventing containers from running with elevated permissions. To accomplish this, Snap built admission controller webhooks.

Build-time

Snap believes in empowering its engineers to architect microservices autonomously within K8s constructs. Snap’s goal was to maximize Amazon EKS security benefits while abstracting K8s semantics. Specifically, 1/ The security of the Cloud safeguarding the infrastructure that runs Amazon EKS, object-stores (Amazon S3), data-stores (KeyDB, ElastiCache, and DynamoDB), and the network that interconnects them. The security in the Cloud includes 2/ protecting the K8s cluster API server and etcd from malicious access, and finally, 3/ protecting Snap’s applications’ RBAC, network policies, data encryption, and containers.

Switchboard – Snap’s mesh service configuration panel

Snap built a configuration hub called Switchboard to provide a single control panel across AWS and GCP to create K8s clusters. Microservices owners can define environments in regions with specific compute types offered by cloud providers. Switchboard also enables service owners to follow approval workflows to establish trust for service-to-service authentication and specify public routes and other service configurations.

It allows service owners to manage service dependencies, traffic routes between K8s clusters. Switchboard presents a simplified configuration model based on the environments. It manages metadata, such as the service owner, team email, and on-call paging information.

Snap wanted to empower tenants to control access to microservice data objects (images, audio, and video) and metadata stores such as databases and cache stores so they deployed the data stores in separate data accounts controlled by IAM roles and policies in those accounts. Snap needed to centralize microservices network paths to mesh with GCP resources. Therefore, Snap deployed the microservices in a centralized account using IAM roles for service accounts that assume roles the tenants’ data AWS accounts.

The following figure shows how multiple environments (Kubernetes cluster) host three tenants using two different IRSA. Friends’ Service A can read and write to a dedicated DynamoDB table deployed in a separate AWS account. Similarly, MCS’ Service B can get and cache sessions or friends in ElastiCache.

Snap’s high mesh design

   Figure 3 – Snap’s low level architecture

One of Snap’s design principles was to maximize autonomy while maintaining their desired level of isolation between environments, all while minimizing operational overhead.

Snap chose Kubernetes service accounts as the minimal isolation level. Amazon EKS support for IRSA allowed Snap to leverage OIDC to simplify the process of granting IAM permissions to application pods. Snap also uses RBAC to limit access to K8s cluster resources and secure cluster users’ authentication. Snap considers adopting Amazon EKS Pod Identities to reuse associations when running the same application in multiple clusters. This is done by applying identical associations to each cluster without modifying the role trust policy.

Deployment-time

 Cluster access by human operators

AWS IAM users and roles are currently managed by Snap which generates policies based on business requirements. Operators use Switchboard to request access to their microservice. Switchboard map an IAM user to a cluster RBAC policy that grants access to Kuberentes objects. Snap is evaluating AWS Identity Center to allow Switchboard federate AWS Single Sign-On (SSO) with a central identity provider (IdP) for enabling cluster operators to have least-privilege access using cluster RBAC policies, enforced through AWS IAM.

Isolation strategy

Snap chose to isolate K8s cluster resources by namespaces to achieve isolation by limiting container permissions with IAM roles for Service Accounts, and CNI network policies. In addition, Snap provision separate pod identity for add-ons such as CNI, Cluster-AutoScaler, and FluentD. Add-ons uses separate IAM policies using IRSA and not overly permissive EC2 instance IAM roles.

Network partitioning

Snap’s mesh defines rules that restrict or permit network traffic between microservices pods with Amazon VPC CNI network policies. Snap wanted to minimize IP exhaustion caused by IPv4 address space limitations due to its massive scale. Instead of working around IPv4 limitations using Kubernetes IPv4/IPv6 dual-stack, Snap wanted to migrate gracefully to IPv6. Snap can connect IPv4-based Amazon EKS clusters to IPv6 clusters using Amazon EKS IPv6 support and Amazon VPC CNI.

Container hardening

Snap built admission controller webhook to audit and enforce pod security context to prevent containers from running with elevated permissions (RunAs) or accessing volumes at the cluster or namespace level. Snap validate that workloads don’t use configurations that break container isolation such as hostIPC, HostNetwork, and HostPort.

Snap’s admission controller service

Figure 4 – Snap’s admission controller service

Network policies

Kubernetes Network Policies enable you to define and enforce rules for traffic flow between pods. Policies act as a virtual firewall, which allows you to segment and secure your cluster by specifying network traffic rules for pods, namespaces, IP addresses, and ports. Amazon EKS extends and simplifies native support for network policies in Amazon VPC CNI and Amazon EC2, security groups, and network access control lists (NACLs) through the upstream Kubernetes Network Policy API.

Run-time

 Audit logs

Snap needs auditing system activities to enhance compliance, intrusion detection, and policy validation. This is to track unauthorized access, policy violations, suspicious activities, and incident responses. Snap uses Amazon EKS control plane logging that ingests API server, audit, authenticator, controller manager, and scheduler logs into CloudWatch. It also uses Amazon CloudTrail for cross-AWS services access and fluentd to ingest application logging to CloudWatch and Google’s operations suite.

Runtime security monitoring

Snap has begun using GuardDuty EKS Protection. This helps Snap monitor EKS cluster control plane activity by analyzing Amazon EKS audit logs to identify unauthorized and malicious access patterns. This functionality, combined with their admission controller events provides coverage of cluster changes.

For runtime monitoring, Snap uses the open source Falco agent to monitor EKS workloads in the Snap service mesh. GuardDuty findings are contextualized by Falco rules based on container running processes. This context helps to identify cluster tenants with whom to triage the findings. Falco agents support Snap’s runtime monitoring goals and deliver consistent reporting. Snap compliments GuardDuty with Falco to ensure changes are not made to a running container by monitoring and analyzing container syscalls (container drift detection rule).

Conclusion

Snap’s cloud infrastructure has evolved from running a monolith inside Google App Engine to microservices deployed in Kubernetes across AWS and GCP. This streamlined architecture helped improve Snapchat’s reliability. Snap’s Kuberentes multi-tenant vision needed abstraction of cloud provider security semantics such as AWS security features to comply with strict security and privacy standards. This blog reviewed the methods and systems used to implement a secure compute and data platform on Amazon EKS and Amazon data-stores. This included bootstrapping, building, deploying, and running Snap’s workloads. Snap is not stopping here. Learn more about Snap and our collaboration with Snap.