AWS HPC Blog

The Convergent Evolution of Grid Computing in Financial Services

If you’re a regular reader of the AWS High Performance Computing (HPC) Blog you’ll be familiar with various solutions for tightly coupled HPC workloads serving customers in industries such as life-sciences or weather forecasting. The Financial Services industry also makes significant use of HPC but it tends to be in the form of loosely coupled, embarrassingly parallel workloads to support risk modelling. Infrastructure tends to scale out to meet ever increasing demand but how long can this continue?

The optimal delivery of compute capacity to support these workloads is a topic which specialists at AWS have been exploring for a number of years and through discussions our customers across capital markets, banking and insurance we’ve discovered some emergent themes. While no two organizations are the same, there is some convergence in the way they are approaching both the challenges of meeting the demands of their business lines today and the opportunities that AWS presents for the future. A colleague likes to point out that this is a form of convergent evolution, just as insects, bats and birds separately and in parallel developed flight, we’re seeing customers from across banking converge into some common patterns to address their HPC requirements.

The aim of this blog is to highlight some of these themes, to challenge the way that HPC teams are thinking about how they deliver compute capacity today and to highlight how we see the solutions converging for the future.

What’s different?

Managing an HPC system on-premises is fundamentally different to running in the cloud and it changes the nature of the challenge. In a traditional on-premises environment HPC teams direct the majority of their energy into maximizing the utilization of a fixed set of resources. Essentially, they exist to ensure that the large (and expensive!) compute infrastructure is utilized in a way which aligns with the priorities of the business. At times of constrained capacity these teams rely heavily on a scheduler to make very rapid decisions about the relative priority of pending tasks to ensure they are optimally placed. Inevitably there will be winners and losers and some tasks will just have to wait.

In the cloud where there is a presumption of near-limitless capacity and where cost is a function of capacity and time, the question of how to schedule tasks is different. If the queue of pending tasks grows above zero then why not just increase capacity and have a result more quickly? Therefore, the challenge is less about scheduling but more about capacity orchestration; the decision being less about ‘which task next?’ but rather ‘how fast and at what cost?’. This can be transformative for businesses, enabling them to make short-term decisions about how quickly to price a trade or which portfolios to run through additional scenarios without locking themselves into long-term commitments to infrastructure.

The following sections of this post broadly reflect the evolution of approaches to the opportunities of HPC in the cloud. AWS has financial services customers at each of these stages and they do not necessarily progress through each in turn with some looking to ‘leap frog’ over some to enable them to reach their future state more quickly.

Bursting for capacity augmentation

One common question is how to bridge the gap between existing infrastructure and the cloud. Customers often seek to augment on-premises capacity with ‘burst’ patterns. This approach has the advantage of helping to solve for peak demand while allowing the customer to also leverage their existing investment in infrastructure and software with minimal change. However, there are some downsides, not least the additional complexity of running both static and highly variable sets of infrastructure. There are decisions to be made about how ‘burst’ is triggered, be it scheduled, demand based or predictive. The operational processes for fixed and long lived on-premises infrastructure are quite different to those for short-lived ephemeral compute instances in the cloud. As a result, there is inevitably divergence in processes for tasks such as patching or access control and teams will need to solve for both.

‘Lift and shift’

Some customers look to ‘lift and shift’ HPC into the cloud and this can either be a first step, or may follow an initial ‘burst’ deployment as the on-premises infrastructure reaches end-of-life. This approach greatly simplifies the architecture and operations as there is a consistent platform to build upon. The ability to elastically scale compute according to the demands of the day greatly enhances efficiency of these grids, with customers able to process the same total compute load in a much shorter window by scaling up capacity.

The ability to describe HPC infrastructure in code also means that it’s possible to create additional environments on demand. This has two effects, firstly it’s a catalyst for innovation as quant teams can provision short-lived clusters (up to production scale if necessary) to test new models or approaches. The second effect is that every cluster that runs in the cloud can be scaled granularly according to demand. The traditional model of building large, centralized, multi-tenant grids to maximize efficiency no-longer applies. Consumer groups can have their own clusters, configured according to their individual demands and optimally sized according to the demands of the moment.

Serverless, event-driven architectures

At this point customers start to realize that their grid scheduler which is optimized to make rapid prioritization decisions is less valuable. As long as orchestration of capacity is straightforward then the requirement to capture tasks and manage their lifecycle can largely be handled with a simple queue.

Once tasks are placed into a queue then there are a number of options to service the queue. Simple ‘greedy worker’ processes running on EC2 instances are easy to scale and typically straight-forward to implement with existing analytics. However, customers are increasingly looking to simplify further by removing servers altogether, either by running these processes in containers managed with AWS Fargate or by implementing event-driven architectures with Amazon SQS and AWS Lambda.

It is this serverless, event-driven future which we see as the logical point of convergence for these workloads as well as for the systems which make up the ecosystem around HPC platforms. Risk Management systems will similarly move into the cloud to benefit from the elasticity and scalability it offers. Canonical market and trade data sources will also move to be closer to their consumers and to open up new analytical opportunities. Downstream systems will likely also migrate as they inevitably become a bottleneck to the scalable processing of cloud-based systems upstream.

This is a fascinating and complex topic which is evolving as customers find new approaches and as AWS develops new instances, services and features to serve their needs. To help guide customers we’ve created and recently updated the Financial Services Grid Computing on AWS whitepaper which explores all of these themes in more detail. Additionally, if you’d like to explore these concepts further, please connect with your account team or make a request through the AWS Financial Services contact form.

Alex Kimber

Alex Kimber

Alex Kimber is a Principal Solutions Architect in AWS Global Financial Services with over 20 years of experience in building and running high performance grid computing platforms in investment banks.