Modernizing and scaling a tightly coupled legacy .NET application on AWS

AWS collaborated with an industrial company to modernize a legacy .NET Framework application using an Experience-Based Acceleration (EBA) program. In this blog, we will explore the EBA process, and the related technical approach. This approach helped our customer modernize a legacy .NET Framework application that resulted in an improved performance by 16.5x while also reducing costs by 50%.

The EBA program is a transformative approach, using a hands-on, agile, and immersive engagement to speed up your organization’s digital transformation and cloud value realization. Hundreds of enterprises have used EBA to build cloud foundations, migrate at scale, modernize their businesses, and innovate for their customers. They have succeeded because of a proven learn-by-doing working model that scales and helps to drive business value. In the modernization EBA, we include two primary workflows:

A series of discovery sessions to identify a meaningful scope of work to facilitate migration and modernization of a legacy workload.
A series of hands-on developer focused workshops that are typically 3–5 days in duration. In these workshops, developers work closely with AWS technical subject matter experts to build and deliver a functioning minimum viable product (MVP) on the AWS Cloud.

At the time of beginning the modernization, the application was hosted on virtual machines within another cloud. It consisted of a web application layer, an API Gateway, and a series of backend calculation engines operating inside containers. The legacy application also had performance challenges. These included long runtimes (~7–8 hours) to complete one job, which incurred high costs due to these long running resources. It also recorded high error rates for the long-running resources.

Use of EBA mechanism

For the EBA, AWS collaborated closely with the customer team to identify the areas of technical ambiguity during a discovery workshop. In this workshop, the customer delineated the current state architecture, defined success metrics for the engagement (performance, cost, user experience). They also identified stakeholders across various functions (application development, DevOps, security, product team) to participate in subsequent workshops. With scope and objectives well-defined, the EBA team devised a week-long sprint where the customer and AWS teams delivered a functional minimum viable product.

Overall technical approach

In the initial discovery workshop, the AWS team pinpointed two primary bottlenecks with the existing application:

Each calculation engine was hosted in a container image along with a localized queue. As requests were received, they are distributed across all current VMs (queued locally within each container instance), resulting in inadequate scaling. For example, when there was an influx of requests, all incoming requests were dispersed among the present nodes before the scaling mechanism was activated.
The Windows-based container image hosting the calculation engines was approximately 12GB in size. Each time a new virtual machine (VM) was provisioned, the image was copied to the VM, which added 5 minutes to the scale-up time.

Following AWS Best Practices, the first step would be to do a re-host migration, then optimize the application to remove the bottlenecks. During the re-host, each container retained a dedicated localized queue and requests were load balanced across the container instances. This signified no modifications would be made to the container images, which also simplified migration of the code base to Amazon Elastic Container Service (Amazon ECS).

With the initial rehost, AWS benchmark testing unveiled an initial performance enhancement of ~30% due to the infrastructure changes addressing CPU and memory utilization. In order to add further gains, the application must be refactored. The team addressed minor architectural adjustments to the container with the use of Amazon Simple Queue Service (Amazon SQS) as the global queue, as seen in Figure 1. This change shifted the queues from being attached to each webserver, to be managed in a single global queue. This refactoring approach addressed the utilization of each task’s capacity and yielded an additional 16.5x in performance gains.

The original architecture before the refactor shows users being round robin assigned to queues that are attached to each webserver. The ‘after’ side of the diagram shows the refactored architectural change. In this new state, all messages are placed on a single global queue. Containerized calculation engines are then created to run tasks that consume the messages in the global queue.

Figure 1 – Re-host vs. Refactor

To satisfy the reliability and scaling requirements, an API Gateway was used to forward incoming XML requests to Amazon SQS. Once queued in Amazon SQS, the requests are consumed by an Amazon ECS cluster. The compute resources employed a capacity provider to control scaling within the Amazon ECS cluster. Amazon SQS queue depth was employed as a metric for scaling. Deeper queues led to more instances launching (scale out) and shorter queues led to instance decommissioning (scale in).

This diagram depicts the end state architecture after the first set of meetings with the customer using the official AWS service icons. The diagram starts with a rest API call hitting an Application Load Balancer. The HTTPS requests are sent to the Elastic Container Service which is backed by and EC2 Auto Scaling group. The Elastic Container Services cluster is shown retrieving container images from the Elastic Container Registry. The EC2 instances are shown as being built and managed by the EC2 Image Builder Service.

Figure 2 – End state architecture

Rehosting application container on AWS

Initially, the customer team decided to re-host the application for the scalability and multi-Region capabilities of the AWS Cloud, as seen in figure 2. This required a new container image, as the application was already containerized, based on an Amazon ECS-optimized Windows 2016 base image. The legacy application had dependencies on Windows 2016, hence the decision to create a new Amazon ECS compatible base image.

With the container image in place, the team proceeded to provision an Amazon ECS cluster with auto-scaling enabled. This was based on aggregated memory and CPU utilization across the cluster. Within the cluster, instances were configured to run the following User Data PowerShell command upon initialization. This automatically joined the provisioned instance to the Amazon ECS cluster.

<powershell>Initialize-ECSAgent -Cluster computeclustername -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]' -EnableTaskENI -AwsvpcBlockIMDS -AwsvpcAdditionalLocalRoutes '["ip-address"]'</powershell>

The subsequent step provisioned an Application Load Balancer to distribute incoming traffic across Amazon ECS tasks as they are initialized. This was accomplished by using Target Groups within Amazon ECS.

Reducing container start-up time

Once the application was re-hosted on AWS, our tests exhibited a 1.9x performance enhancement, decreasing time from 6.6 hours to 3.5 hours. The team identified additional opportunities to render the application more cost-effective to operate and expedite response time. The AWS team pinpointed an approximately 8-minute delay in the provisioning process for Amazon Elastic Compute Cloud (EC2) instances associated with the Amazon ECS cluster. Further examination revealed that the root cause of this lag was due to two factors:

The container image was downloaded into Amazon EC2 instances from an on-premises container repository securely via Direct Connect. This introduced greater than expected network latency and delays based on the customer network traffic.
The container image size of 12GB resulted in prolonged download times when scaling out to newly provisioned EC2 instances.

To resolve the first issue, the AWS team directed the customer to use Amazon Elastic Container Registry (Amazon ECR) to host container images. This would also publish the container image into the Amazon ECR service. For more information on pushing a container image to Amazon ECR see, Pushing a Docker image to an Amazon ECR private repository.

To mitigate the second challenge of container size, the AWS team advised the customer to use Amazon EC2 Image Builder. This embedded the large container image into the Amazon Machine Image (AMI) used by the Auto Scaling group associated with the Amazon ECS cluster. The customer team added this into their Jenkins pipeline. Every build action of the container image was then baked into the Amazon EC2 Image via Image Builder. Consequently, when a new instance joins the cluster, the container image no longer needs downloading from Amazon ECR. It is available locally within the instance upon provisioning.

Using a global queue

Despite expedited cluster scaling, performance testing outcomes remained stagnant. Upon close examination, we observed some EC2 instances exhibiting high utilization while others displayed minimal utilization. This signified a secondary flaw in the architecture that needed to be addressed. Incoming workloads were allocated across individual queues within each Amazon ECS task. Containers were obligated to process any traffic received, while newly spun up nodes could not absorb existing loads. This resulted in uneven utilization of the new instances after scaling out. The AWS team proposed implementing a global queue and using queue depth as the scaling metric.

The original architecture routed requests to localized queues within each container image. They did not dynamically scale to fulfill CPU and memory requirements. If two requests, one small and one large, were initiated concurrently, nodes processing the smaller request would complete rapidly and idle. Nodes processing the larger request persisted. After the local queue scaled up to satisfy demand, it remained empty rather than downscaling, resulting in the customer paying for inactive, unused resources. A global queue was instituted to address these issues. This global queue aggregates all incoming requests and disperses them across available containers and dynamically scaling CPU/memory to satisfy demands.

The issue was solved by substituting the Application Load Balancer with Amazon SQS and API Gateway. This solution necessitated no refactoring of upstream applications and a 100-line modification to the application code itself. When creating a queue with Amazon SQS, customers can opt for standard or First-In-First-Out (FIFO) queues. Since order mattered for the customer’s use case, with single message delivery, FIFO was chosen. FIFO provides order preservation and no duplicate processing. Amazon SQS offers short polling (default) or long polling for message receipt. Long polling to consume messages was selected to reduce the number of API calls.

Amazon API Gateway supports HTTP, RESTful, and WebSocket API operations. The customer team used API Gateway’s native Amazon SQS integration to relay REST API calls to Amazon SQS with zero code. The customer adopted this approach to avoid refactoring the application. They created a low-code queue as a consumer adapter to move the messages from Amazon SQS to the compute engine. This replaced the use of local queues which was accomplished within hours.

Fine-tuning scaling metrics

While scaling and performance matters were resolved, the AWS team observed an unsatisfactorily high error rate. The initial scaling approach relied on aggregated cluster CPU utilization. The AWS team identified that the majority of errors stemmed from the scale-in process. Tasks were terminated without validating that in-process jobs were completed. To solve this, a capacity provider was configured with scale-in protection, and revisions to scaling metrics selected. Rather than scaling up per cluster CPU utilization, the metric was altered to queue depth instead.

When the queue exceeds defined thresholds, the cluster scales up incrementally. Minor queue growth spurs the addition of a few tasks, while substantial depth triggers provisioning of multiple tasks up to the ceiling. Queue depth furnishes confidence, recognizing surges in advance to scale out proactively rather than awaiting CPU utilization escalation.

With scale-up addressed, queue depth likewise governed scale-down. A sustained zero-depth queue prompts the cluster to scale-in, reducing the number of instances. The cluster was configured for cross- Availability Zone (AZ) affinity when placing containers, prioritizing reliability, and secondly, bin packing for cost efficiency. The capacity provider operates with the Auto Scaling group, honoring scale-in protection, so with zero queue depth, the Auto Scaling group reduces EC2 count after task completion. The provider decreases tasks per bin packing rules and scale-in settings. When an EC2 instance is task-free, the termination resumes.

These scaling policy changes afforded tasks time to complete computations and then be put in an idle state. This drastically diminished active computation termination. While some in-flight computations may still occur, AMAZON SQS-fed tasks are not deleted until database write is completed. This helped to minimize the impact of early terminations restarting. With scale-in removed as a source, these factors drove the transaction error count near zero in a cost-effective manner.

Cost Optimization

With the application fully operational and scaling in production, the AWS team commenced optimizing processes to achieve an ideal cost/performance balance. Appropriate instance types were identified for the workload. Each task necessitated specific RAM/CPU ratios. We determined that distinct tasks should reside on separate clusters with underlying EC2 instances. These instances must be tailored to container workloads via a selection of C, M, or R series instances.

We explored instance savings plans based on average and peak utilization. This helped the customer with a discounted average capacity, while paying On-Demand Instance rates only for burst use.

Finally, fine-tuning of Auto Scaling group metrics was imperative. Originally, the CPU was the metric governing cluster up/down scaling. Rigorous load and performance testing revealed the CPU as a reactive metric, scaling too late. Migrating to a global queue illuminated queue depth as a proactive scaling metric. Catching load earlier reduced overall latency when servicing calculation spikes. Cost reduction was achievable by matching VMs to task size, evaluating appropriate purchasing options, and selecting proactive rather than reactive scaling metrics.

Conclusion

Through use of the EBA mechanism, the team worked diligently to take an incremental value creation approach by re-hosting, then refactoring the application on AWS. This helped to improve performance and reduce costs. Using Amazon SQS, the customer was able to decouple tightly coupled workloads and scale based on queue size. The scale-up time was shortened from 8 minutes down to 2 minutes per instance and scale-in was refined to avoid early termination of unfinished tasks. This overall approach led to a 16.5x reduction in processing time, increased flexibility and efficiency, and a 50% decrease in cost. If you are an existing AWS customer, contact your AWS account team to find out more about EBA programs.

References:

AWS re:Invent 2023 – How Carrier Global is saving 40% with Windows containers on AWS (ENT212): https://www.youtube.com/watch?v=JUkJ2HQBBT8

Creating an EC2 based Amazon ECS cluster: https://docs.thinkwithwp.com/AmazonECS/latest/developerguide/create-ec2-cluster-console-v2.html

Using Application Load Balancer with Amazon ECS: https://docs.thinkwithwp.com/AmazonECS/latest/developerguide/alb.html

Migration & Modernization