Containers
Amazon ECR in Multi-Account and Multi-Region Architectures
Introduction
Amazon Elastic Container Registry (Amazon ECR) is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere. It stores container images and artifacts that deploy application workloads across AWS services as well as non-AWS environments. Amazon ECR is a regional service, where each Region in each account is provisioned with a managed container registry instance. Amazon ECR integrates with AWS Identity and Access Management (AWS IAM) to enable multiple accounts to access a registry instance. However, this might not be the best approach for your AWS environment, because separating services and workloads across AWS accounts can help mitigate risk and provide boundaries between teams in your organization. Also, because Amazon ECR is a regional service, you may want to utilize Amazon ECR registries in multiple Regions to be closer to customers, or for disaster recovery scenarios. In this post, we cover key considerations for Amazon ECR architectures that span across AWS accounts and AWS Regions, and share some architectures related to hypothetical customer use cases.
Considerations for architecting across AWS accounts
Multi-account architectures are becoming increasingly common. AWS accounts and Organizational Units (OUs) in AWS Organizations provide security, access, and billing boundaries for your AWS resources. The Organizing Your AWS Environment Using Multiple AWS Accounts AWS whitepaper goes into detail on the benefits of separating your environment into different accounts and OUs, and best practices for doing so. For this post, we focus on two concepts that can help guide you in determining the right AWS account structure and relate this to your container registries. These concepts are grouping workloads based on ownership and business purpose, and separating others to limit the scope of impact from adverse events.
The goal of grouping workloads based on ownership and business purpose is to align your AWS environment with the responsibilities of the teams in your organization and the relationships among them. Grouping a team’s services together into an AWS account where they have decision making power removes unnecessary dependencies on other teams. For example, you can set up each application development team or developer with their own isolated sandbox environment to give them the freedom to test services without having to evaluate the impact to others or seek approval, as long as they are using services that are within overall guardrails for the OU. It’s also recommended to group workloads based on their business purpose, which means to group workloads that perform similar or related functions, so you can apply overarching policies with the appropriate controls. Your developers’ sandbox accounts have very different requirements than a platform security account or an account that automates delivery of application updates with a Continuous Integration/Continuous Delivery (CI/CD) pipeline, but the same requirements as other development accounts. Grouping them together in a Sandbox OU allows you to place controls across these similar accounts, rather than managing permissions at the account level.
From a risk management perspective, there are also workloads that you want to separate to limit the impact of adverse events. The primary example (and best practice) is to place your production workloads in their own account or OU. These should ideally be without any dependencies on non-production workloads, so that issues in non-production environments don’t impact your production workloads. You want stringent access policies on your production workloads that require all changes to the environment are automated to avoid human error with user writes (e.g., deployments to production only happen through automated CI/CD pipelines).
Considerations for architecting across AWS Regions
Operating across multiple Availability Zones (AZs) in a Region is a best practice, but some customers with very high availability requirements can use multiple AWS Regions to provide further resiliency. Other customers may decide on a multi-Region approach to serve customers or development teams that are geographically dispersed. If you want to use a multi-Region approach, you need a deep understanding of your applications’ dependencies, volumes, and the nature of the work it performs, and should also regularly test failover capabilities. Because Amazon ECR is a regional service, you have different endpoints for your registries in each Region (and each account). For that reason, many customers that use Amazon ECR across Regions build tooling (e.g., mutating webhook in a Kubernetes deployment manifest) to automate pulling from different Amazon ECR endpoints. This is just one example of the complexities of multi-Region builds, and doesn’t consider other aspects like Domain Name System (DNS), networking configurations, or compute provisioning. However, if you have highly critical business applications or geographically dispersed teams and customers, then effectively building multi-region applications will maximize resiliency and uptime.
How Different Stakeholders Use Amazon ECR
Amazon ECR is central to containerized applications, which plays a role throughout the build and deployment processes. For that reason, different stakeholders interact with Amazon ECR to achieve their respective objectives. Some common stakeholders that interact with Amazon ECR, and their respective uses for Amazon ECR, are provided in the following sections. These stakeholders are not necessarily different people or teams. A startup may have a small team of developers that own the entire stack, whereas a very large organization may have a different team performing each stakeholder’s role in support of hundreds of consumer teams. Note that we are defining the objectives for each stakeholder in this post, but these are not definitive rules and you may choose to group these objectives differently in your organization. How these stakeholders’ responsibilities are divided among teams within your organization can help you decide how Amazon ECR should be configured in your AWS account architecture.
Security and Compliance
This stakeholder’s objective is to determine the security requirements for an environment, and implement capabilities to maintain those requirements. Core responsibilities include: establishing a permission structure for your Amazon ECR registry (ideally leveraging the principle of least access), determining what type of encryption to use (e.g., AWS Key Management Service (AWS KMS)), the vulnerability scanning tools to be used, and how vulnerabilities are responded to.
Many customers establish a repository of approved base images, or images that all your application images are built on top of (e.g., Ubuntu, Amazon Linux 2). Vetting and approving base images is a best practice for container security, and a recommendation from NIST Special Publication 800-190, because base images are underlying components in every container deployed in your environment. We refer to approved base images as library images in the rest of this post and provide example architectures.
Platform Infrastructure
The objective of a platform infrastructure team, as method of building and maintaining shared services utilized by other development teams, is to improve system reliability (e.g., shared Kubernetes cluster, or a shared container registry). With Amazon ECR, much of the undifferentiated heavy lifting of maintaining a container registry (i.e., patching and provisioning the underlying infrastructure) is offloaded to AWS. Therefore, platform teams can focus on other challenges, like how Amazon ECR’s replication is used as a mechanism to distribute container images to different Regions to support disaster recovery with backup regions, or to decrease data transfer costs associated with pulling images across Regions. Platform teams could also oversee the lifecycle policies that are established in Amazon ECR repositories to automatically cull images to mitigate storage costs, improve security by reducing the chances of building on old images that haven’t been patched, and simplify operations for developers so there is less clutter of images in their repository.
Developer Tools
The objective of a Developer Tools team as building and maintaining tools that automate the software development lifecycle (SDLC): accelerating delivery, implementing quality standards, and minimizing manual errors. Typically, this is automated through an organization’s CI/CD pipelines. Container registries may not be considered a CI/CD tool, but they are imperative to building and deploying containers. These teams often work with Amazon ECR to build services that abstract certain elements away from developers, like the mutating webhook we referred to for multi-Region deployments. If the CI/CD pipeline has to deploy across Regions, the Developer Tools team could use Amazon ECR’s cross-Region replication feature to store images locally for lower latency pulls, to optimize for costs by reducing data transfer costs of cross-region pulls, or replicate them into different accounts for those tenants to use them in builds.
Developer Tools stakeholders also build the how to achieve the policies determined by Security stakeholders. For example, building a workflow with Amazon Inspector and Amazon EventBridge to scan images before promoting them to the next repository (e.g., from Developer to Quality Assurance [QA], from QA to Production), and promoting them if there are no critical Common Vulnerabilities or Exposures (CVEs), or flagging the image, moving it to a quarantine repository, and notifying the owning Application team if there are CVEs that need to be remediated.
Application Teams
The Application stakeholders are the individuals building the applications. All of the stakeholders described in the previous section, exist to remove barriers and non-value-added work so these teams can focus on building their business logic. These stakeholders are often less involved with registry configuration, but still need to interact with the registry frequently. For example, to build the organization’s business logic, the Application teams must have the necessary images (e.g., Library images) available to build on top of in their development environment, and have the right access to pull these images and push updated images to the registry (or to push to the CI/CD pipeline that then pushes to the registry) for testing and deployment. They have to be able to pull container images to local development environments to build and test images before pushing them into central registries or CI/CD pipelines.
Solution overview
Example Amazon ECR Architectures
So, we’ve covered some key concepts for deciding on the right AWS account structure for your organization. Now let’s look at some different examples for hypothetical companies that put these concepts together.
Small Organization with Full Stack Engineers
For the first example, let’s consider a small organization that has developers working across the entire stack. We’ll call this organization Scrappy LLC. Because the development team at Scrappy works across the entire stack, we have less organizational boundaries and less AWS accounts, which are all supported from a single, central Amazon ECR.
Walkthrough
To build its container images, the Scrappy team starts by using the pull through cache (PTC) feature in Amazon ECR to cache public images from ECR Public and Quay.io. Once cached in their registry, Scrappy’s CI/CD pipeline builds development images on top of these public images, which are tested in a development Amazon Elastic Container Service (Amazon ECS) cluster. Once tested, the images are promoted to a production repository from which its production Amazon ECS cluster can pull and run the images. Scrappy has also enabled enhanced image scanning with Amazon Inspector to perform vulnerability scanning on images in their registry, and these scan findings are aggregated in AWS Security Hub.
In the Amazon ECR portion at the top of the diagram, you can see that the Scrappy team uses repositories to separate its PTC, development, and production container images (the next example shows separating them with AWS accounts). Amazon ECR allows you to set permissions at the repository level with repository policies, so Scrappy can set different permissions for their PTC, development, and production repositories. Taking the PTC repository as an example, Scrappy needs to allows its developers to pull the cached images to build on top of, but doesn’t want anyone to be able to push images to this repository. Here is an example repository policy they could use to allow only the developer_role_name IAM role to pull images from the PTC repository.
To mitigate risk to their production workloads, Scrappy placed its production Amazon ECS cluster in a separate account with more stringent access controls than its development account. Because the production workload is in another account, Scrappy is using Virtual Private Cloud (VPC) endpoints with Amazon ECR to securely pull images across accounts. Scrappy also set a repository policy on its production repository to allow its production workloads to pull images from the other account. This post shows the process to allow a secondary account to push or pull images from an Amazon ECR repository.
Large Organization with Separate Platform, Security, Developer Tools, and Developer Teams
For the next example, let’s consider an organization that has separate teams for its platform infrastructure, security, developer tools, and application developer teams. We call this hypothetical company Resourced, Inc.. To organize their AWS environment according to ownership, we separated each Resourced team into its own AWS account.
In the Resourced Platform account, the Platform team set up PTC repositories to cache public images from Amazon ECR Public and Quay.io, and then scans these images for vulnerabilities with AWS Inspector, before pushing them into a repository configured to replicate them into the Developer Tools account. In the bottom left, we have a Security account where Amazon Inspector scan findings from all accounts are aggregated.
In the Developer Tools account, we have a CI/CD pipeline pulling from and pushing to Amazon ECR to build containers, and then scanning these containers before deploying to Amazon Elastic Kubernetes Service (Amazon EKS) clusters in Development accounts for Team A and Team B (Note: this structure would likely support dozens of development teams, as opposed to two; however, that makes for an unwieldy diagram). The Resourced Developer Tools team also built a process to quarantine any images that fail scan results to a separate Amazon ECR repository. Notice that all the Resourced accounts are utilizing VPC endpoints to securely transfer images across accounts.
On the right side of the architecture, we have our production workloads and their supporting CI/CD pipelines, which the Developer Tools team created to minimize Production workload’s dependencies on the pipelines in their development environments. In the Production accounts, we have a local Amazon ECR registry to minimize the traffic going over our VPC endpoints, which are used throughout the architecture to securely transfer images from one account to another. Also, notice that this organization has a disaster recovery environment in the us-west-2
Region, with a replica CI/CD pipeline and production application environments to maximize their availability. The Developer Tools would utilize cross-region replication to replicate production images from us-east-1 to its backup region in us-west-2
to ensure that it has the necessary container images to run their applications in a disaster scenario.
Resourced has also placed its production workloads in a production OU and development accounts in a development OU in AWS Organizations to facilitate managing these accounts. Let’s say that Resourced is adding 30 development teams to be supported by this infrastructure, each with its own development and production account. With OUs, Resourced’s security team can set the security policy at the OU level once, and add the new accounts to the production or development OUs accordingly, rather than configuring the access controls on each new account individually.
In the Scrappy example, we referenced applying a repository policy to allow pushing to and pulling from Amazon ECR across AWS accounts. Amazon ECR repository policies allow you to place the policy on OUs, rather than on individual accounts, which removes the burden of having to add or remove individual accounts as they are created or decommissioned. Below is an example of setting an image pull policy, like the one we would need to allow the production workloads to pull from the Development Tools Production registry, based on the OU ID (this post goes into detail on repository policies using OUs).
Conclusion
In this post, we showed you some key decision factors in deciding to split an AWS environment across AWS accounts and AWS Regions, and how different stakeholders use Amazon ECR. We then looked at two example Amazon ECR architectures showing a simpler, single registry example, and a more complex example from a large, distributed organization. We also provided some example repository policies for each organization to show how different sizes of organizations can use Amazon ECR’s repository policy feature to manage permissions according to their specific needs. As you work on deciding the right architecture for your environment, you should review the Organizing Your AWS Environment Using Multiple Accounts whitepaper, and you can refer to the Amazon ECR documentation as you start to configure you container registry according to your AWS account and Region structure.