Amazon EC2 Spot Instances for scientific workflows: Using generative AI to assess availability

In recent years, public sector organizations have found success running their scientific data processing workloads on Amazon Web Services (AWS). As the number of workloads increase with the massive data volume and complex scientific simulations, organizations are looking for ways to optimize cost while maintaining research momentum. Amazon EC2 Spot Instances presents a compelling option to run unused Amazon Elastic Compute Cloud (Amazon EC2) capacity with an up to 90 percent discount compared to On-Demand prices. However, the intermittent nature of Spot Instances often requires careful consideration, especially when handling time-sensitive mission-critical workloads.

In this post, we discuss how organizations can effectively identify opportunities to use Spot Instances and Amazon Q Business, a generative AI–powered assistant that can answer questions and provide summaries based on data and information in your enterprise systems, to develop an enhanced Spot Instance analysis.

Identifying suitable workloads for Spot Instances

When evaluating workloads for Spot Instance adoption, it’s critical that organizations carefully assess their scientific computing tasks based on mission criticality, time sensitivity, and operational characteristics. Because Spot Instances can be interrupted with a two-minute notification window, they’re not recommended for workloads that can’t tolerate instance interruption. The following patterns describe scenarios where organizations might consider Spot Instances as part of their scientific data processing architecture, subject to thorough evaluation against specific organizational requirements and compliance needs.

Short-running workloads

Workloads with short execution times can be candidates for Spot Instances because they have a lower probability of experiencing an interruption during their execution cycle. These workloads might be completed before the potential Spot Instance interruption happens, or allow restart of the process on a Spot instance in a different capacity pool, given the shorter runtime. However, organizations need to make sure these workloads have robust retry mechanisms and the ability to track completion status in case of interruption, even with shorter runtimes.

Fault-tolerant architectures

Scientific applications built with comprehensive fault tolerance mechanisms might be suitable for Spot Instances. These architectures typically include distributed computing frameworks that can handle node failures to maintain workflow state and restart failed tasks. Implementing checkpointing mechanisms is also critical to success to allow workloads to resume from their last known good state, whether on new Spot Instances or by failing over to On-Demand Instances as needed. AWS Fault Injection Service is a fully managed service that enables you to perform fault injection experiments on your AWS workloads and can also be used to test your application’s resilience to Spot Instance interruptions.

Bursting workloads

Scientific computing workloads often have a predictable baseline processing requirement with periodic spikes in computational needs. For example, federal agencies performing satellite imagery processing might have consistent daily analysis requirements but require additional

compute capacity during data reprocessing periods when analyzing historical datasets with new algorithms. Although the baseline computational needs can be cost optimized through the Amazon EC2 Savings Plan and Amazon EC2 Reserved Instances, Spot Instances can be considered for handling burst capacity during peak processing periods, provided the application can handle instance interruptions.

Stateless workloads

Applications designed with stateless components can be suitable for Spot Instances because they don’t maintain critical state information on the instance itself. The workloads should store state in external, highly available storage services, which makes them more resilient to instance termination. Organizations should validate that there is proper testing of state management and recovery procedures before implementing Spot Instances in production environments.

Time-flexible workloads

Scientific workloads without strict completion deadlines can be candidates for Spot Instances. These include data pipelines that aren’t time-critical, where processing can occur over extended periods and can accommodate interruptions while waiting for new capacity. The ability to schedule workloads during off-peak hours can also provide access to more stable Spot Instance capacity, though this should be verified through careful capacity planning.

Parallel data processing workloads

Scientific workflows that can be parallelized across multiple nodes present opportunities for Spot Instance usage, particularly when individual tasks can be processed independently. In the event of a Spot Instance interruption, only the specific parallel task needs to be reprocessed while other computations continue unaffected. Organizations should implement proper job tracking and task queue management to confirm that failed tasks are properly rescheduled.

Best practices to use Spot Instances and generate Spot placement score analysis with Amazon Q Business

After identifying suitable workloads for Spot Instances, organizations should implement several best practices to maximize availability while optimizing costs, as detailed in the AWS Compute Blog post “Best practices to optimize your Amazon EC2 Spot Instances usage.” The post dives into areas including instance diversification, attribute-based instance type selection, allocation strategy, and Spot placement scores.

In this post, we focus on using the Spot placement score, which is a feature of Spot Instances that indicates how likely a Spot request will succeed in an AWS Region or Availability Zone with a score of 1–10. A score of 1 indicates a lower likelihood of success, and 10 represents the highest likelihood. The Spot placement score will fluctuate as capacity changes, but the score is particularly valuable for:

Identifying optimal instance type combinations for capacity needs
Simulating future Spot capacity requirements
Selecting the most suitable Availability Zone for Single-AZ workloads
Planning cross-Region capacity relocation strategies

To obtain accurate Spot placement scores, configurations must include at least three different instance types, allowing for better capacity pool diversification.

Enhanced EC2 Spot placement score analysis assistant

The existing Spot placement score tracker solution allows organizations to automatically capture Spot placement scores at five-minute intervals and visualize them through Amazon CloudWatch dashboards. Although this provides valuable baseline monitoring, organizations often require more sophisticated analytical insights to make data-driven decisions about their Spot utilization strategy.

To address this need, we have enhanced the Spot placement score tracker solution by implementing a Spot analysis assistant using Amazon Q Business. You can use the assistant to perform comparative analysis for Spot capacity across different attributes and query Spot placement score trends across dimensions such as temporal patterns, AWS Regions, instance configurations, and capacity variations. After you’ve gained insights into the estimated Spot capacity available, you can ask questions about AWS general best practices for Spot Instances by sending queries directly to the large language model (LLM) powering the assistant.

Solution overview

The following figure shows the solution architecture. Using this solution, you can view the dashboard from the Spot placement score tracker and interact directly with the Spot analysis assistance for in-depth analysis.

Figure 1. Architectural workflow for the Spot analysis assistant showing the major components, including Amazon EventBridge, AWS Lambda, Amazon EC2, Amazon S3, Amazon CloudWatch, and Amazon Q Business.

Solution walkthrough

Prerequisites – install the Spot placement score tracker

The Spot placement score tracker solution is described in the steps below. Note that while the Spot placement score tracker provides a default configuration file, the file should be modified to include different configurations for spot instances based on your requirements.

The Amazon EventBridge cron functionality starts the execution of the spotPlacementScores AWS Lambda function every five minutes.

The Lambda function retrieves dashboard configuration files in YAML from Amazon Simple Storage Service (Amazon S3).

The Lambda function handles batches of metric request. For each request, it queries the Amazon EC2 API Spot placement score feature to get a Spot placement score.

The Lambda function retrieves the Spot placement score responses and creates metrics in CloudWatch based on a metrics configuration specified in the project metric configuration file.

CloudWatch collects metrics for various workloads and populates the CloudWatch Spot placement score dashboards. You can access these dashboards to optimize your Amazon EC2 Spot Instance requests.

Spot analysis assistant

The spot analysis assistant is then implemented in the following steps:

An EventBridge rule will run a daily scheduled rule to trigger a Lambda function to retrieve the metrics from CloudWatch.

The Lambda extraction function handles retrieval of the previous 24 hours of metrics data and writes to Amazon S3 as a raw file.

A new file in Amazon S3 triggers an S3 event that begins the transform and append Lambda function to process raw metrics data and perform column transformation such as renaming columns for configurations. It then appends it to a single consolidated file.

The consolidated file in Amazon S3 is directly uploaded to the Amazon Q Business interface (50 MB maximum for a single document).

You can ask questions of the Spot analysis assistant.

The following section shows examples of responses generated from the Spot analysis assistant.

Analyzing Spot placement score trends

The following question asks about the Spot placement score trends observed in the dataset. The assistant generated a breakdown of the Spot trends across different configuration types to describe a steady score of 9 for most of the configurations except for p24xlarge and above instances in the US East (N. Virginia) us-east-1 Region, which had a consistent low score of 3.

The following screenshot shows the question, “What are the Spot placement score trends that you’re seeing?” and the response from the Spot Analysis Assistant.

Figure 2. Response to the first question regarding Spot placement score trends, breaking it down by configuration types.

Region selection for instance configuration

The question in the following screenshot asks for a Region recommendation for a specific configuration group and the reasoning behind that recommendation. The assistant provided a score comparison and analysis of the stability of the Spot placement score between US East (N. Virginia) us-east-1 and US West (Oregon) us-west-2 Regions for the r 8xl and above configuration. The assistant recommends US East (N. Virginia) us-east-1 as the preferred Region because it provides a higher Spot placement score and more stability.

Figure 3. Response to the second question regarding Region selection for the r 8xl and above configuration.

Best practices for deploying Spot Instances

After a specific configuration and Region have been selected through the analysis process, the assistant also allows you to ask questions regarding best practices to deploy Spot Instances. The response from the assistant dives into areas such as capacity optimization, monitoring, logging and tracking, and backup planning, as shown in the following screenshot.

Figure 4. Response to the third question regarding best practices when deploying a specific configuration of Spot Instances on AWS.

Conclusion

For public sector organizations exploring Spot Instances for scientific data processing pipelines, this post outlines key workload patterns and evaluation criteria to effectively incorporate Spot Instances into their architecture. The introduction of the Spot Analysis Assistant, powered by Amazon Q Business, enhances the existing Spot placement score tracker by providing intuitive analysis of placement score trends across different configurations, Regions, and time periods so users can make data-driven decisions about their Spot utilization strategy. Although generative AI applications are nondeterministic in nature, the assistant serves as a valuable aid for customers to dive deeper into these best practices and trends for validation.

By combining the cost benefits of Spot Instances with intelligent analysis capabilities, organizations can now confidently identify optimal instance configurations, analyze Regional capacity trends, and implement best practices for Spot deployment while maintaining operational efficiency in their scientific computing workloads. This solution demonstrates the AWS commitment to making advanced cost optimization tools more accessible to public sector customers, helping them maximize their research impact while optimizing cloud spending.

To get support with evaluating Spot Instances for your existing workload or deploying the Spot Analysis Assistant, please reach out to your AWS account team.

AWS Public Sector Blog