Migration & Modernization

Unleashing the Power of the Cloud with the AWS Cloud Value Framework (CVF) – Operational Resilience (4/7)

Pillar 3: Operational Resilience

Introduction

This blog forms part of a series on the AWS Cloud Value Framework (CVF). The CVF serves as a comprehensive guide to help businesses to evaluate, quantify, and communicate the value of AWS Cloud adoption. It comprises five pillars, with this blog focused on Pillar 3: Operational Resilience.

In this blog, we will explain:

  • The Operational Resilience pillar
  • The benefits that organizations have been able to achieve
  • How organizations were able to use AWS to achieve these benefits
  • Examples and case studies
  • How to demonstrate the value for your own migration

Pillar 3: Operational Resilience

Operational Resilience refers to an organizations ability to continue operating through disruptive events. These can occur from a cyber attack, natural disaster, human error, device failure, and other unexpected events. Maintaining Operational Resilience is crucial to avoid business disruption, loss of revenue, and damage to reputation. The Hackett Group published The Business Value of Migration to Amazon Web Services. The study showed that organizations improve their Operational Resilience across several Key Performance Indicators:

Key performance indicator Overall respondents Top performers
Before migration After migration % change Before migration After migration % change
Security-related incidents per month* 3.1 1.7 -45% 1.4 0.5 -64%
Mean time to detect security incidents (minutes) 156.2 94.6 -39% 85.7 39.8 -54%
Critical infrastructure-related incidents per month* 1.4 0.7 -50% 0.5 0.2 -60%
Unplanned outages in 12-month period* 1.3 0.6 -54% 0.4 0.1 -75%
Unplanned downtime hours in 12-month period 40.0 12.5 -69% 10.0 2.0 -80%
Percentage of infrastructure SLAs consistently met 65% 80% 23% 80% 91% 14%

Table 1: Resiliency improvements achieved from migrating to AWS. Source: The Hackett Group, The Business Value of Migration to Amazon Web Services

*Per 1,000 connected devices

Organizations have been able to improve their Operational Resilience through:

  • Redundancy and High Availability – AWS offers multiple Availability Zones within each Region, this helps organizations to distribute their workloads across multiple data centers. This redundancy helps when one Availability Zone experiences a fault, the application remains available, reducing the risk of downtime and improving resilience. The reliability of AWS was a key decision for moving from on-premises to the cloud for S&P Global. Because it must comply with Securities and Exchange Commission (SEC) regulatory frameworks for monitoring financial services, the company also sought to improve the stability of its IT environment. “We had an aging data center, with system outages becoming more frequent,” Wang says. “For our business, downtime means we face substantial penalties from the SEC.”. One consideration when migrating to the cloud, is reviewing the business requirements for uptime and availability. One of the benefits of the cloud is that you can have business critical applications spanning multiple Availability Zones. Remember that depending on your requirements, some applications may be suitable to run in a single Availability Zone. Multi-region architectures can be deployed. Refer to the AWS Multi-Region Fundamentals whitepaper for further considerations on a multi-Region architecture. We often find on-premises environments where a multi-data center design is configured for all applications. However, this may not align to business requirements and also increase operational costs. The availability needs required for a workload must be aligned with the business needs and criticality. Refer to the Reliability pillar of the AWS Well-Architected Framework for further details on how to get started.
  • Auto Scaling – Organizations use AWS Auto Scaling to monitor their applications and automatically adjust capacity to maintain steady, predictable performance at the lowest possible cost. If you’re already using Amazon EC2 Auto Scaling to dynamically scale your Amazon EC2 instances, you can now combine it with AWS Auto Scaling to scale additional resources for other AWS services.
  • Monitoring and AlertingAmazon CloudWatch is a monitoring service that provides real-time visibility into AWS resources and applications. CloudWatch can collect and track metrics, monitor log files, and set alarms to trigger notifications when specific conditions are met.
  • Automation – According to the Uptime Institute, human error is one of one of the large causes of downtime. With a modern cloud architecture, organizations are using automation to reduce or completely remove human effort. AWS CloudFormation helps you model and set up your AWS resources with an executable template. The template is then used to describe all the AWS resources and configurations in code, such as Amazon EC2 instances or Amazon Relational Database Service (Amazon RDS) DB instances.
  • Disaster Recovery – Organizations like Thomson Reuters have improved their disaster recovery capability by migrating to AWS. Organizations use an AWS Elastic Disaster Recovery solution that enables them to replicate their entire IT infrastructure to the AWS Cloud. This provides a scalable, cost-effective application recovery to AWS. Using AWS for DR can also help minimize the investment needed to implement a disaster recovery solution. Most importantly, you can tailor the solution needed to align with your business requirements – refer to Figure 2. Disaster recovery strategies on AWS can be broadly categorized into four approaches. These approaches range from the low cost and low complexity of making backups, to more complex strategies, which are higher in cost but have a lower Recovery Point Objective (RPO) / Recovery Time Objective (RTO).

    The image shows DR Strategies on AWS consisting of four approaches: Backup and Restore - RPO / RTO: Hours, Lower-priority use cases, provision all AWS resources after event, restore backups after event, Cost: $ Pilot light - RPO / RTO: 10s of Minutes, Data live, Services idle, Provision some AWS resources and scale after event, Cost: $$ Warm standby - RPO / RTO: Minutes, Always running, but smaller, Business critical, Scale AWS resources after event, Cost: $$$ Active / Active - RPO / RTO: Real-time, Zero downtime, near-zero data loss, Mission Critical Services, Cost: $$$$

    Figure 2: Disaster Recovery strategies on AWS

  • AWS Resilience Hub – This hub is a central location in the AWS Management Console for you to manage and improve the resilience posture of your applications on AWS. It enables you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework. Within Resilience Hub, you can also create and run AWS Fault Injection Service (AWS FIS) experiments, which mimic real-life disruptions to your application to help you better understand dependencies and uncover potential weaknesses.

Using these and other capabilities, organizations can minimize downtime, mitigate risks, and scale nearly continuous operations despite disruptions.

Demonstrating Business Value of Operational Resilience

Calculating the operational resilience benefits with AWS involves assessing factors such as reduced downtime, enhanced security, and streamlined operations. You’ll want to quantify these benefits in terms of cost savings, efficiency gains, and risk mitigation to determine the overall impact on your organization’s resilience. This often requires analyzing historical data, conducting risk assessments, and considering how using the cloud will help reduce disruption to your business.

It’s important to try to estimate the cost of downtime to your business. A recent Fortune 1000 survey conducted by IDC revealed that the average cost of an infrastructure failure is US$100,000 per hour. The cost of downtime will vary by the size of your organization, industry, business model, and applications impacted. When building out the cost of downtime, you should consider:

  • Loss of revenue
  • Reputational damage
  • Customer churn
  • End user productivity
  • Operational team effort to resolve incidents/downtime
  • Regulatory impact – fines, reporting
  • Downstream / Upstream supply chain disruptions

Considering these factors, you can estimate the cost per minute or hour of downtime. To develop an initial view of the benefit, you can use the Hackett Group study, Unplanned Downtime KPI as a benchmark to forecast the benefit. The following table shows a way to calculate the benefit and Figure 3 shows how to visualize the Business Value of Operational Resilience.

On-Premises * AWS *
Annual Downtime 30 hours Estimated Annual Downtime inc 69% reduction 9.3 hours
Impact of Downtime $20,000 per hour Impact of Downtime $20,000 per hour
Impact per Year $600,000 Impact per Year $186,000
Impact Over 5 Years $3,000,000 Impact Over 5 Years $930,000
5 Year Benefit $1,035,000

Table 2: Cost of Downtime comparison for on-premises and AWS

* Figures are for illustrative purposes only

The image is a bar chart showing a 5-year cost comparison between on-premises and AWS using the data from table 2. The chart compares cost of downtime over 5 years for on-premises and AWS. The chart summarizes the potential value of $1.04 million.

Figure 3: Cost of Downtime comparison for on-premises and AWS

Additional Support

For additional guidance in demonstrating the Business Value, refer to the Cloud Economics Center and contact your AWS account representative for a complimentary Cloud Economics assessment.