AWS for Industries
Reduce semiconductor design costs using Amazon EC2 Spot and Exostellar Infrastructure Optimizer
Semiconductor chip design is a compute intensive, expensive process. Exostellar is an AWS partner that developed Exostellar Infrastructure Optimizer to allow chip designers to minimize the risk of using Amazon EC2 Spot Instances for long-running, stateful electronic design automation (EDA) jobs. Amazon EC2 Spot Instances can provide up to 90% discounts compared to On-Demand pricing. Spot instances are unused EC2 capacity and can be reclaimed with a 2 minute warning.
In this blog, we will discuss how Exostellar Infrastructure Optimizer is enabling chip designers to reduce costs by executing long-running, stateful EDA jobs on Amazon EC2 Spot instances.
The challenges of semiconductor design in the cloud
Semiconductor chips power most of the electronic devices we use daily. Phones, computers, cars, appliances, and TVs use chips with devices measured in nanometers (a billionth of a meter) and that perform billions of calculations per second. Engineers design these chips using advanced electronic design automation (EDA) tools that ensure that the final products meet their requirements and perform reliably. The tools require vast amounts of compute and storage over extended periods of time. This creates two key challenges.
The first challenge is getting the capacity required to run all of the EDA tools. Each new generation of chips requires exponentially more compute. The scale of the AWS Cloud can help chip designers meet their constantly growing compute and storage requirements. Even when they have on-premises infrastructure, AWS allows them to scale with additional capacity when needed.
The second challenge is to manage the costs of running the EDA tools on Amazon EC2 instances. AWS provides different pricing models to help customers manage their costs. Amazon EC2 On-Demand Pricing lets you pay per second for workloads that are short-term, spiky, or unpredictable with no up-front costs or long-term commitments. Amazon EC2 Reserved Instances (RIs) and Savings Plans (SPs) can help you reduce your bill by up to 72% compared to On-Demand prices in exchange for a commitment to a consistent amount of usage (measured in $/hour) for a 1- or 3-year term. Amazon EC2 Spot Instances can provide up to 90% discounts compared to On-Demand pricing. Spot instances are unused EC2 capacity and can be reclaimed with a 2 minute warning. This makes Spot Instances cost-optimized for various stateless, fault-tolerant, or flexible applications that can run on heterogenous hardware.
Many EDA workloads, such as block level simulations, are short running and not mission critical so they run cost-effectively on Spot Instances. Other EDA jobs like place and route and timing analysis are long-running, stateful, and mission critical. They are also not fault-tolerant, so On Demand Instances were the preferred choice for running these workloads. Engineers run these jobs the most near the ends of projects and even then, intermittently. Savings Plans require an hourly commitment over 1 or 3 years that you pay whether you use the instances or not. If you have sustained usage then they will save you money, but if you don’t then On-Demand Instances are a more cost-effective option. You can use the HPC Cost Simulator to analyze historical job information and get a recommendation on how to minimize your EC2 costs using RIs and SPs. The AWS Compute Optimizer can also make recommendations to help optimize your EC2 costs. These tools can look at your historical information and make recommendations on how to use Savings Plans and different instance types to reduce your On-Demand costs.
Exostellar Infrastructure Optimizer
Exostellar created Infrastructure Optimizer to help customers optimize their cloud infrastructure costs by using Spot Instances for jobs that would normally need to run on On-Demand Instances. It has two key features that enable this: live migration and its own artificial intelligence (AI) advisory service.
Infrastructure Optimizer runs applications in nested virtual machines (VMs) that can be live migrated across the network between instances. The processor state, memory contents, local EBS storage, IP address, EBS root volume, and network and storage connections are all preserved during the migration with minimal performance impact or disruption to the application. Applications don’t require any modifications to run in the Infrastructure Optimizer VM. The amount of time for a live migration depends on the application’s memory footprint and the available network bandwidth between the source VM and the destination VM. An m7i.48xlarge has 50 Gbps network bandwidth and 768 GiB of memory. The live migration would require over 154 seconds (2.6 minutes) to migrate the memory over the network. However, if only 5 Gbps is usable through the network interface, this grows to 1536 seconds (25.6 minutes). The time will also increase if the application is also using network bandwidth.
Exostellar Infrastructure Optimizer’s AI advisory service uses signals from AWS such as rebalance recommendations to predict Spot terminations far enough in advance to provision a new instance and live migrate the application to the new instance. The new instance can be a different instance type, in a different AZ, or even an On-Demand instance. This improves the application reliability of even very long (up to 30 days), stateful EDA jobs on Spot Instances. Exostellar’s innovative approach allows semiconductor companies to run cost-optimized EDA workloads on EC2 Spot Instances and take advantage of the operating scale of AWS.
Maximizing Application Reliability and Best Practices
Infrastructure Optimizer enables customers to achieve over 99.5% reliability when running applications on Spot Instances, however, it cannot guarantee 100% reliability. It must predict a Spot termination far enough in advance to provision a new instance and migrate the application. If the Exostellar advisory service predicts a Spot termination too late, the application will fail when the Spot Instance terminates. If Exostellar Optimizer cannot provision a new instance for the migration because of insufficient capacity, the application will also fail.
Note that EC2 terminates Spot Instances (with a two minute warning) because On-Demand usage increases and shrinks excess capacity. These risks can be mitigated, and the reliability of the application improved, by following best practices for Amazon EC2 Spot. Spot refers to the instances with a specific instance type in an AZ of a region as a Spot Capacity Pool. You need to be as flexible as possible in selecting the capacity pools for your application. For example, for an application that requires memory optimized instances, you may prefer to run on r7i instances. You can reduce costs and increase reliability by also configuring r5, r5d, r6i, r6id, r7i, r7iz, x2idn, x2iedn, and x2iezn instance types. Spot placement scores can identify the best capacity pools for availability of those instance types. By increasing the number of Spot capacity pools your application can use, you increase the likelihood of successful workload completion, and decrease costs by enabling more Spot usage.
Currently, Infrastructure Optimizer only supports the x86 architecture and not the Graviton family of instances. Additionally, it cannot migrate applications between Intel and AMD processors. You can configure separate Infrastructure Optimizer queues for Intel and AMD instance types, and you must choose which you will use when you submit a job. You can use Spot placement scores to help you choose the configuration with the highest available Spot capacity.
Cost simulations
We analyzed the cost of running EDA workloads on a variety of cloud service configurations, including On-Demand Instances, Spot Instances, and Savings Plans, both with and without the integration of Infrastructure Optimizer, as illustrated in figure 1. Our aim was to determine the most cost-effective cloud solution for running semiconductor workflows. Our cost simulation covered a period of over 21 months and compared various configurations of On-Demand Instances, Spot Instances, and Savings Plans options. The results showed that Exostellar’s optimization, with 80 percent utilization of Spot Instances, could save hundreds of thousands of dollars without the necessity of upfront commitments or long-term contracts, as depicted in figure 1. Graph Analysis The graph presents the total cost for four configurations:
- OD + Spot (Short Job) – The cost of running short jobs on spot and longer jobs using On-Demand (OD) is the most expensive option, totaling $1,366,000. This combination offers flexibility but comes at a premium due to the higher costs associated with On-Demand Instances.
- OD + Spot + SP (Savings Plan) – Costs can be reduced to $1,014,000 by using a Savings Plan with all costs paid up front. The hourly commitment for the Savings Plan accounts for sustained instance usage with the remaining instances using On-Demand or Spot pricing. While reducing costs, it still involves significant expenses due to the upfront financial commitment.
- 100% in Spot (Theoretical) – A theoretical setup using only Spot Instances achieves a lower total cost of $816,000. However, lack of spot availability and spot terminations may cause lost work and productivity.
- Exostellar – 80% in Spot – With Exostellar’s Infrastructure Optimizer utilizing 80% Spot Instances and no long-term commitment, costs amount to $926,000. This approach offers a significant reduction in costs compared to traditional On-Demand and Savings Plan configurations, providing a flexible, cost-effective solution for EDA workloads without the need for upfront commitments.
Figure 1. Cost simulation outcomes from extensive testing of an enterprise-grade EDA tool on AWS, with and without Exostellar
The results demonstrate Exostellar’s ability to achieve substantial cost savings while avoiding upfront commitments or long-term commitments. By optimizing workloads with 80% utilization of Spot Instances, Exostellar provides a flexible, cost-effective solution for semiconductor workflows.
Testing an EDA workload on Infrastructure Optimizer
AWS and Exostellar set up a test to evaluate running an enterprise-grade, compute-intensive EDA tool in Infrastructure Optimizer VMs. The test focused on verifying that the live migration process does not change the application behavior or results. The EDA tool was a licensed tool, so part of the testing was also to validate that live migrations do not affect license checkouts. They selected a tool and workload that consisted of 427 jobs that each required eight cores and 32 GB of RAM. They used Slurm to manage the jobs and the AWS compute instances.
They first ran the workload on On-Demand Instances to set the performance baseline. Then they ran the workload on Infrastructure Optimizer VMs and forced live migrations every 110 seconds to create a worst-case scenario to evaluate if the Infrastructure Optimizer VMs or the live migration process would introduce errors. This resulted in 1,990 migrations across 427 jobs. No jobs failed and no workload behavior changes were observed. There were also no issues with the license server. Despite the high frequency of migrations, the impact on runtime was minimal, with an average increase of only 40 seconds per job, translating to about 8.6 seconds per migration, and only about a 7% overall slowdown. This demonstrated the efficiency and robustness of the live migration process.
The process of integrating Slurm with Exostellar’s solution involved the following steps:
Figure 2. The process of integrating and optimizing Exostellar with SLURM
- Configure Exostellar cloud compute nodes in the Slurm configuration.
- Configure the Slurm power saving API to use an Exostellar plugin to power up and power down Exostellar VMs.
- The plugin calls the Exostellar control plain to start EC2 instances.
- Exostellar provisions nested virtual machines (VMs) that join the Slurm cluster and run jobs, just as an end user would expect from a regular Amazon EC2 VM.
- These nested VMs are visible in the job queue and support all existing Slurm functionality.
- Exostellar advisory service predicts spot terminations, starts new instances, and live migrates jobs to the new instance without network connection interruptions or job terminations.
The live migrations are transparent to Slurm, because the nested VMs maintain their network address and all network connections with only a momentary application pause as the migration is finalized. This allows users to have a familiar experience while Exostellar seamlessly handles the complexity of shifting their jobs between the most cost-effective instances that are able to run their jobs. For example, during the life of the job it could shift between c6i, m6i, and r6i instances with minimal impact to performance.
Achieving cost savings and reliability on AWS for EDA
The results of the testing demonstrated the reliability of the live migration process. This enables EDA tools to reliably run on Spot Instances with significant cost savings compared to On-Demand.
- Cost efficiency: Spot allows up to 90% discounts compared to On-Demand pricing without up-from payments or long-term commitments.
- Robust stress testing: During the testing, live migrations were triggered every 110 seconds. This resulted in 1,990 migrations across 427 jobs and an increase in run time of 7%. No jobs failed and we didn’t observe any workload behavior changes. The average frequency of Spot interruption across all Regions and Instance types has historically been <5%; the actual interruption rate for your workloads will depend on point-in-time available capacity.
- Seamless integration: The workload ran without any changes to the application or the scheduling API.
Conclusion
Semiconductor companies like Arm already use the AWS cloud and Amazon EC2 Spot Instances to cost-effectively run EDA workloads at massive scale. Exostellar Infrastructure Optimizer makes it possible to do this even with long-running, memory-intensive, mission critical workloads that run for days or even weeks. Testing confirmed that we can run EDA workloads on Spot Instances using Infrastructure Optimizer VMs to take advantage of up to 90% savings compared to running the same workloads on On-Demand instances. Contact your AWS account team and Exostellar to learn how you can optimize your EDA infrastructure to increase performance and throughput, increase engineering productivity, and reduce costs.