AWS HPC Blog

How Amazon optimizes their supply chain with help from AWS Batch

This post was contributed by Michael Stalcup, Software Development Engineer at Amazon, and Angel Pizarro, Principal Developer Advocate at AWS.

Amazon’s retail business, with its millions of products coming from hundreds of thousands of sellers, is in a constant state of optimization. It’s no surprise that there is team whose core mission is to provide optimal inventory distribution recommendations – balancing trade-offs given demand, transportation costs, and delivery speed. Recently, they undertook an effort to move from a bespoke orchestration application for building and running containerized ML models to AWS Batch.

We talked with Michael Stalcup, Software Development Engineer on the Amazon fulfillment team, about the challenges his group faced that led to their migration to AWS Batch.

Let’s start with the beginning: What business challenge does your team solve for, and how do they do that?

Our team provides outbound transportation network planning recommendations, optimizing for total cost.

In particular, my team supports a web application where scientists can easily configure and run a sequence of models. Our application orchestrates data file transfers and the compute resources needed to execute those models.

Figure 1 – A subsections of the Step Functions workflow showing the multi-step process of managing the scaling up and down of EC2 instances to run a job request to run models.

Figure 1 – A subsections of the Step Functions workflow showing the multi-step process of managing the scaling up and down of EC2 instances to run a job request to run models.

So in simple terms, your team solves a classic computer science problem – the “traveling salesman” – to optimize for both fast delivery times and lower costs, which is a pretty important part of e-commerce. Let’s dive into some technical details. You mentioned to me that your team recently moved your infrastructure for running these models to AWS Batch.

What did the system architecture look like before?

Before Batch, we were orchestrating the Amazon Elastic Compute Cloud (Amazon EC2) instance lifecycle ourselves using AWS Step Functions. After deploying a Docker image to our AWS account, users could start a model run using our home-grown web application. We would handle launching a dedicated EC2 instance for them, installing and running their code, terminating the instance, handing off output files, and recording metrics. Figure 1 shows an approximation of the steps involved, zooming in on a single model run’s EC2 management.

As the size of the science team grew, we went from running hundreds of models per day to thousands. When thousands of runs started within minutes, we were hitting request limits on AWS services, exhausting instance availability – and hitting account-level limits. We increased our account quotas to temporarily handle the extra load, but we knew we needed to reduce or throttle our AWS service requests (like the ones for launching or terminating instances). That’s where AWS Batch was extremely helpful.

Scaling from an initial architecture is a common challenge for all our customers, and I’m sure more than a few readers can sympathize with that situation. Can you talk about your new architecture and how the different components solve your scaling challenge?

Because Step Functions has synchronous integration with Batch, and because users’ models were already packaged as Docker images, we were able to replace the old orchestration with simply one step: Submit Batch Job.

This gave us a few immediate scalability benefits:

  1. We no longer call EC2 directly to launch and terminate instances – instead Batch manages an autoscaling group for us. That means no more request limit issues on those EC2 calls.
  2. We no longer had to launch one instance per model run – now multiple jobs could run on the same Batch instance, reducing our peak instances in use (which we saw during load testing) by up to 50%.
  3. The job queue gave us a higher quality of service – If instances were unavailable for a compute environment, jobs remain in a RUNNABLE state until they are available, rather than just failing immediately.
  4. We were able to reduce our system’s complexity– Because Step Functions can synchronously wait on a Batch job, we no longer had to use a callback to step functions.
Figure 2 – The revised Step Function workflow replaced the multi-step process in Figure 1 with a single step process leveraging AWS Batch to handle all aspects of scaling EC2 resources for running models.

Figure 2 – The revised Step Function workflow replaced the multi-step process in Figure 1 with a single step process leveraging AWS Batch to handle all aspects of scaling EC2 resources for running models.

To quote Andy Jassy, “Nice.” How is the new system performing?

Before Batch, we observed load tests failing for 2,000 concurrently-launched model executions through our system. After migrating to Batch, our application can consistently handle our 2,000-concurrency load test, and we’re working toward our goal of 5,000 concurrent model runs without failure by the middle of this year. We’re already making progress toward that goal, and thanks to Batch, there are no more compute bottlenecks in our system.

That’s a great result! What are some of the lessons you all learned along the way that our readers can benefit from?

Design for scale from the start
It’s impossible to know the future, but any design should consider scaling for future demand. When we built our service in 2019, Batch was already available. We should have done our research to at least consider Batch or Amazon Elastic Compute Cloud (Amazon ECS) at that time, and it would have saved us a future migration effort.

Have a precise plan for how to reserve memory/vCPUs for different job types
Before Batch, we were letting users choose a dedicated instance-type they wanted their model to run on. This was good for experimentation, but it’s not scalable. With Batch, we have more granular control over memory and vCPUs required at the job definition level which allows us to monitor resource utilization and optimize that – but there are still are a few factors that made this complicated for us:

  1. Because we need to run a variety of ever-changing [but largely memory-intensive] models, it’s difficult to keep resource utilization efficient and updated.
  2. Under-allocating resources is risky because it may lead to jobs failing during execution (or worse – not even starting)
  3. Over-allocating resources is costly

We have good job-level resource monitoring thanks to container insights, but we still have work to do for optimizing resources across all jobs.

Know what instance types you need, and keep things simple
Batch has a service quota of 50 compute environments and 50 job queues – it’s best to keep these as simple as can meet your needs. During migration, we initially created ~20 compute environments to match all instance types that our users configured and make the changes seamless to them. Later we realized this wasn’t sustainable and also not necessary: it’s actually more optimal to let multiple jobs run on each instance. That allowed us to better use resources and reduce how often we scale our clusters.

After that realization, we drastically simplified what compute environments our users’ jobs run on by putting them all into six “buckets” of R type and Z type instances of SMALL, MEDIUM, and LARGE sizes.

This configuration reduced the mental load for our users, but still gave them the flexibility of being able to request more capable instances if they think they need it. Additionally, it lets us group more jobs into similar compute environments so that instances are reused more often, lowering costs while increasing throughput.

Consider the risks of running multiple jobs on the same instance
While moving to Batch certainly solved a lot of challenges, it also introduced some. Before Batch we were running jobs each on a dedicated instance with 100GB of storage. But with Batch, it’s possible that jobs are assigned to run together on shared instances if they’re assigned to the same queue. This caused disk space issues for some of our jobs because instead of having a dedicated 100GB of storage, they were sharing it with any other job(s) assigned to the same instance. We solved this by simply increasing the EBS volume attached to our instances to accommodate the maximum potential storage needed.

Another fall out from shared-tenancy of jobs is the potential for jobs to overwrite data of other jobs. You should audit if your data needs to have access restricted. Then, ensure that you don’t mix jobs together on the same instance that should not be able to access each other’s data, or just set up restrictions to enforce this as needed.

Looking forward, we’re excited by the benefits that Batch has given us already, and recommend it to anyone looking to run Dockerized jobs at scale. Next, we’re planning to use Fargate compute via Batch, and to use job queue priority and fair share scheduling to let the most important jobs run first.

Conclusion

Our thanks to Michael for this conversation and insights into how Amazon solves their own challenges.

If you want to learn more about how AWS Batch can help you to scale your processes, you read the AWS Batch user guide, or you can dive right in and log into the AWS Management Console and get started today!

Michael Stalcup

Michael Stalcup

Michael is a Software Engineer on the Fulfillment Optimization team. Michael has been at Amazon for 3 years focusing mainly on improving back-end operations and tech scaling. Besides coding, he enjoys writing and recording music, traveling, and hiking.

Angel Pizarro

Angel Pizarro

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.