Containers

Three things to consider when implementing Mutual TLS with AWS App Mesh

NOTICE: October 04, 2024 – This post no longer reflects the best guidance for configuring a service mesh with Amazon ECS and Amazon EKS, and its examples no longer work as shown. For workloads running on Amazon ECS, please refer to newer content on Amazon ECS Service Connect, and for workloads running on Amazon EKS, please refer to Amazon VPC Lattice.

——–

Mutual Transport Layer Security (mTLS) is an extension of TLS, where both the client and server leverage X.509 digital certificates to authenticate each other before starting communications. Both parties present certificates to each other and validate the other’s certificate. The key difference from any usual TLS communication is that when using mutual TLS, each client must have a client-side certificate for the TLS handshake. The client sends its TLS certificate to the server, the server verifies the client’s certificate, and then the server grants the client access just like the server does in a one-way TLS setup.

Mutual TLS authentication is used when the server needs to ensure the authenticity and validity of a client. Mutual TLS is used when the server needs to identify each client, often part of zero trust networks. Common use cases include communication between IoT sensors and central servers (such as AWS IoT Core) or those between microservices in highly regulated environments. When implementing Mutual TLS, there are many design decisions that need to be made upfront. You must decide how to manage the certificates for the clients (because you tend to have more clients than servers), how long you want certificates to be valid for, and how you would store the sensitive certificate material. In this blog post, we will focus on the microservice use case within a container platform, such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). We will discuss each major design decision when implementing mutual TLS, with AWS App Mesh providing a mechanism to secure and authenticate the workloads.

AWS App Mesh

When implementing mutual authentication in a container platform, it is common to initiate and terminate the connection with a network proxy running in a sidecar container. This is to provide standardization in an organization as well as to remove the burden of certificate management from the application team. However, maintaining the configuration and certificates within a large number of network proxies could become an operational burden. Therefore, service meshes are often used to manage the configuration of network proxies in a container platform.

AWS App Mesh is a service mesh that provides application-level networking to make it easy for your services to communicate with each other across multiple types of AWS compute infrastructure. To figure out how we can build a mTLS-enabled service mesh with AWS App Mesh, we first need to understand the concepts within AWS App Mesh. The first thing you will create is a Mesh; this is a logical boundary for network traffic between the services that reside within it. After creating a Mesh, you could create a virtual gateway, virtual service, virtual router, routes, or a virtual node. For more information on the AWS App Mesh concepts and for getting started with App Mesh, please see the Getting Started Guide on the AWS App Mesh documentation.

The following diagram shows how all concepts of AWS AppMesh could be used to route external traffic to a container workload.

  • A virtual gateway allows resources that are outside of your mesh to communicate with resources that are inside of your mesh.
  • A virtual service is an abstraction of a real service that is provided by a virtual node directly or indirectly by means of a virtual router. Dependent services call your virtual service by its virtualServiceName, and those requests are routed to the virtual node or virtual router that is specified as the provider for the virtual service.
  • A virtual node acts as a logical pointer to a particular task group, such as an Amazon ECS service or a Kubernetes deployment. Any inbound traffic that your virtual node expects is specified as a listener.

You can enable mutual TLS authentication for all of the protocols supported by AWS App Mesh. They are TCP, HTTP/1.1, HTTP/2, gRPC. Mesh endpoints such as VirtualNode and VirtualGateway are where you specify which certificate you would use for session negotiation and which trusted authorities you use to validate the client’s certificate. So, where should we start in designing an mTLS-enabled App Mesh?

Design decisions for an mTLS-enabled AWS App Mesh

There are three things that you should be sure to define before you start a mutual TLS deployment: 1) how would you generate the certificates, 2) how would you deliver those certificates to the network proxy container, and 3) how would you ensure all the mTLS-required services are onboarded to the mesh.

First, you should define an organization-wide certificate authority using a variety of tools. It can be your own CA management system or a managed service such as AWS Certificate Manager (ACM) Private CA. There are a couple of things that people miss when they choose which way to go. The first is how long you want the certificates to last. The certificates for internal “service-to-service” communication often have a shorter expiration date than external “end user-to-service” certificates. You might not want the internal certificates to be valid for one full year to minimize the risk of a leak. The second is ensuring that the certificates are only accessible to the right parties. If you decide to manage all of the certificate infrastructure yourself, ensure there is a level of access control to the certificates.

Second, you should understand how the certificate management system delivers the certificates to the network proxy container, how you would rotate the certificate and how you would revoke the certificate if you had to. In some scenarios, you may have to manually handle the lifecycle of the certificates depending upon the certificate management system you choose. Manual intervention could dramatically increase the operational overhead of implementing mutual TLS. Therefore, it is critical to consider all scenarios during the decision phase.

Lastly, one of the most challenging parts of implementing mutual TLS is ensuring nothing breaks when you enable mutual TLS for a service. Identifying all of your internal and external clients could be difficult, so as part of migrating to mTLS , it is advised that you run your network proxies in permissive mode for a while. This allows your services to communicate with and without client certificates during a migration. With the correct monitoring and policies in place, you could inspect: 1) whether all the mTLS-required services in the mesh have mTLS enabled and 2) whether the clients abide by the mutual TLS rules. We will dive deeper into these questions with both cases of Amazon ECS and Amazon EKS.

Enabling mTLS with AWS App Mesh in Amazon ECS

Let’s take a closer look at answering these questions for a workload running on Amazon ECS. This part of our blog post does not provide a walkthrough. Rather, we focus on the design considerations for AWS App Mesh when used with Amazon ECS. You can find a detailed walkthrough regarding how to set up mutual TLS with files or with ACM Private CA in this aws-samples GitHub repository.

Source of certificates and managing certificate lifecycle

AWS App Mesh leverages the Envoy network proxy as sidecar containers for its data plane. When using Amazon ECS, there are two methods to mount a certificate within the Envoy sidecar container. The first is file-based, meaning you mount your certificates as files within the Envoy container file system through volume mounts or shared storage. The second is leveraging AWS Certificate Manager Private Certification Authority, where Envoy can communicate with the ACM APIs to retrieve the service-side certificate. At this time, Envoy network proxies are unable to retrieve client certificates from ACM directly, only server certificates. Therefore a custom process is required for Envoy to retrieve client certificates from ACM. See an example in this walkthrough.

If you choose to use AWS Certificate Manager Private Certification Authority as your CA for mutual TLS, there are two additional things that need to be considered. At the time of writing, AWS App Mesh does not support certificate rotation with ACM, and the shortest length we can set for ACM PCA certificates is 13 months. This means you need to manually renew it if you want it to be shorter. As a client certificate cannot be retrieved from ACM by Envoy directly, depending on the custom process you are using, you may need to restart the ECS task to pick up a new certificate. You can find more information on using AWS Certificate Manager Private Certificate Authority with AWS App Mesh in this blog post. It uses AWS Secrets Manager to store the certificates and AWS Lambda to do management jobs.

Policy controls

Ensuring the required services within a container platform are enforcing mutual TLS requires strict policies and governance. This is because there are a few layers to enforce. The first is ensuring all required services are on the mesh. The second is ensuring all clients inside and outside the mesh follow your mutual TLS policies. In Amazon ECS, you can leverage automation as part of your CI/CD pipeline to ensure the mesh configuration has been applied. For further details, see this blog on injecting Envoy proxies with CodePipeline.

Enabling mTLS with AWS App Mesh in Amazon EKS

When implementing mTLS on AWS AppMesh for a workload running on Amazon EKS, the key considerations stay the same. However, there are some additional tools that can be leveraged from the Kubernetes ecosystem.

Source of certificates and managing the lifecycle

When running AWS App Mesh on Kubernetes, there are two sources that can mount a certificate within the Envoy sidecar container. The first is file-based, where a certificate is mounted into the file system of the Envoy container through Kubernetes secrets, volume mounts, or shared storage. The second is using Envoy’s Secret Discovery Service (SDS) API through SPIRE, an open-source framework capable of bootstrapping and issuing identity to services.

When using file-based certificates, an automated process would have to be created to distribute the certificates. After generating the certificates, you could store the certificates as Kubernetes Secrets. Kubernetes Secrets can be mounted into the Envoy container by the App Mesh Controller for Kubernetes. It is also worth designing the process to update the Kubernetes Secrets with new certificates when the existing certificates expire or need to be revoked.

Envoy Secret Discovery Service (SDS) is an inbuilt mechanism that enables Envoy to fetch TLS certificates or secrets from remote sources. SPIRE, which is an implementation of the Secure Production Identity Framework for Everyone (SPIFFE) project and one of the most popular options of dynamic workload identity management, can be configured as an SDS provider for Envoy. Envoy dynamically fetches certificates, allowing SPIRE to do the lifecycle management and a lot of the heavy lifting for you. A SPIRE server could become a root certificate authority, or you can use upstream CAs, such as ACM PCA, by integrating the SPIRE server through an upstream CA plugin. For more information on leveraging SPIRE for mTLs with AWS App Mesh, see this blog post.

Policy controls

Open Policy Agent (OPA) provides a policy-based control for cloud-native environments. Gatekeeper is an OPA subproject for Kubernetes. You can bring up OPA Gatekeeper to enforce the mutual TLS policies for all the required services. Gatekeeper is basically a doorkeeper, guarding the door and checking that all new services hold required certificates and client policies whenever a new resource is created. If you are interested, this blog post gives a good overview of using Gatekeeper in Amazon EKS. You can also use Kyverno as an alternative, which is another policy engine built to support the policies deployed natively without using another programming language. If you are interested in learning more about using Kyverno in Amazon EKS, this blog will be helpful.

Summary

In this blog post, we walked you through the questions you should ask before implementing mutual TLS with AWS App Mesh for workloads running on both Amazon ECS and Amazon EKS. First, to generate and manage certificates, you have to decide upon a certificate source. This could be a self-managed certificate authority or a managed service such as AWS Certificate Manager. To make this decision, you should consider the operational overhead, the ability to automate, and certificate requirements (such as length of the certificates). Second, to reduce the operational overhead, you need to understand the mechanism of distributing the certificates and how you would renew or revoke them if necessary. You might bring AWS Lambda and AWS Secrets Manager for it. Finally, you should come up with a way to ensure all the required containers follow your mutual TLS policy, which could be accomplished by your own mechanism in Amazon ECS and Gatekeeper in Amazon EKS.

For additional information, we recommend reading through the following blog posts: