Cross-account HPC cluster monitoring using Amazon EventBridge

With large automated workflows deploying fleets of HPC clusters, monitoring the status and resource consumption of these instances can be challenging. This difficulty increases as the size of these fleets grows and spans multiple AWS accounts and regions.

Many customers and partners have faced this issue and built their own solutions, often using SSH and agents for polling. These methods can introduce security and privacy risks.

In this post, we’ll show you how AWS built a serverless solution to help users monitor the status of Amazon Elastic Compute Cloud (Amazon EC2) instances deployed in an HPC environment. Using this approach, administrators can create a secure and lightweight status monitoring system, sending only relevant information to a separate monitoring account. We’ve designed this reference solution to be customizable, allowing you to monitor the specific metrics and data you need.

Challenge to solve

A major benefit of moving HPC workloads to the cloud is the flexibility to pay only for what you use. HPC workloads are typically large and temporary. They often run on clusters of many instances, where faults with a single thread or hardware component can cause the entire workload to fail. Since these workloads can last from hours to weeks, it’s important to monitor these jobs not just from a cost perspective of compute infrastructure and software licenses, but also for delivery times and deadlines. These issues can arise from a variety of sources such as running out of memory or disk, over-parallelization leading to under-utilization of the CPU, to application-level issues such as solution divergence, missing input files, incorrect input parameters, or insufficient license features.

Given how HPC workloads across many industries like manufacturing, financial services, or weather prediction leverage large ephemeral fleets of instances, end users often find it challenging to easily identify and monitor the instances they’ve deployed. Commonly, customers or partners will innovate their own solution using homebrewed Bash or Python scripts to collect logs or even to manually monitor the scheduler or compute nodes. But, if there’s an issue with this custom solution, like a process stalling and not returning logs, you may be left unaware of which resources or instances might be idle and consuming budget.

Use cases

Given those challenges, we designed our solution to address two main use cases:

Centralized monitoring for divisional accounts: An organization with many divisions using separate AWS accounts to deploy elastic EC2-based HPC clusters wants a central IT administration group to monitor these resources in real time. This provides telemetry and visibility from one centralized source.
Third-party HPC management: A third-party managing HPC deployments in customer accounts wants to help customers track usage, create budgets, send notifications, and provide visibility into HPC use. Customers also prefer to share only relevant logs and activities with the partner.

Solution overview

The AWS cross-account HPC telemetry reference architecture helps you monitor HPC resources across accounts securely. This architecture is designed for organizations with multiple AWS accounts to support multiple separate HPC user groups or cost centers. This architecture is also ideal for HPC management and SaaS partners with solutions deployed in customer accounts. It offers a secure way to monitor resources without direct access to the compute resources or environment. This also improves the HPC end-user experience and helps organizations understand HPC usage better, while partners can also offer better management tools for their customers.

This solution demonstrates how different AWS services can be leveraged to monitor state changes for Amazon EC2-based HPC clusters. In Figure 1, the infrastructure is deployed across separate AWS accounts:

HPC cluster account(s) – These accounts are where the Amazon EC2 instance-based HPC compute clusters are deployed.
Centralized monitoring account – This is a centralized account where one or more HPC cluster accounts sends cluster notification statuses.

Figure 1: HPC cluster monitoring architecture that uses Amazon EventBridge, AWS Lambda and Amazon CloudWatch cross-account observability to send logs across AWS accounts.

The solution monitors Amazon EC2 instance state changes (start, stop, terminate) in the HPC cluster accounts. The cluster instances are tagged with a custom label to identify the correct instances and associated cluster. These events are passed to Amazon EventBridge, a serverless service enabling event-driven architecture. EventBridge extracts, filters, and forwards state change events to the simple Lambda function we’ve developed for this architecture. This function filters on the instance tags and stores event information such as instance IDs, custom cluster instance tags (by default searches for tags with “HPC”), and instance status changes into Amazon CloudWatch Logs.

The Lambda function deployed in the HPC cluster account allows you to inspect and choose only relevant CloudWatch data to send. Moreover, you can explicitly mask CloudWatch log data types with data protection policies to further safeguard sensitive data. By default, only the source AWS account ID, instance ID, instance state, and the filtering tag on the instance is passed. CloudWatch cross-account observability then securely shares this information across AWS account boundaries. The monitoring account requires a predefined CloudWatch service-linked role, called AWSServiceRoleForCloudWatchCrossAccount, while the source account requires a CloudWatch-CrossAccountSharingRole explicitly granting the monitoring account ID or organization ID to assume that service-linked role.

You can then link multiple source accounts to a single monitoring account, and remove those links at any time. These logs can now be sorted, aggregated and consumed in the monitoring account and visualized with a CloudWatch Dashboard (Figure 2) or other tools.

Figure 2: Example CloudWatch dashboard of returned EC2 instance status metrics in the monitoring account, panel sections include the current status, number of stopped, and number of running instances per cluster, along with a timestamped record of instant state changes.

Conclusion

This serverless architecture, using Amazon EventBridge, AWS Lambda, and Amazon CloudWatch Logs, provides a secure and effective way to monitor EC2 instances of HPC environments across multiple accounts. It’s designed as a flexible framework that you can customize with more advanced metrics, sorting, parsing, data visualization, notifications, and event-driven actions.

For more details on how to setup the solution, visit the AWS Architecture Center. You can also download sample codes and scripts to deploy this reference architecture from our GitHub repository.

Select your cookie preferences

AWS HPC Blog

Cross-account HPC cluster monitoring using Amazon EventBridge

Challenge to solve

Use cases

Solution overview

Conclusion

Resources

Follow