AWS for Industries
How Georgia-Pacific (a Koch Industries Company) Improved their Resilience Posture at Scale using AWS Resilience Hub
Introduction
Georgia-Pacific embarked on a cloud application modernization and resiliency improvement journey. In this blog post, we look at how Georgia-Pacific transitioned their Disaster Recovery (DR) processes from largely a manual effort to a more automated one, improving their application resiliency at scale in AWS using AWS Resilience Hub.
About Georgia-Pacific
Georgia-Pacific, owned by Koch Industries, is an American wood products, pulp, and paper company based in Atlanta, Georgia. The organization is one of the world’s largest manufacturers and distributors of pulp, towel and tissue paper and dispensers, packaging, and wood and gypsum building products.
About AWS Resilience Hub
AWS Resilience Hub provides customers a central place to define, validate, and track the resilience of their applications on AWS. Customers can use AWS CloudFormation stacks, resource groups, AWS Service Catalog AppRegistry applications, and Terraform state files to describe their application infrastructure in Resilience Hub.
Customers then assess their applications’ resilience against desired Recovery Point Objective (RPO) and Recovery Time Objective (RTO) targets. The Resilience Hub’s assessment uses best practices from the AWS Well-Architected Framework to analyze an application’s components and uncover potential resilience weaknesses. Once assessed, customers receive actionable recommendations, along with recovery procedures in the form of code snippets to enable their applications to meet their RTO and RPO targets.
AWS Fault Injection Simulator (FIS), which is integrated with Resilience Hub, can be used to simulate real-world failures and validate that applications can recover within a customer’s defined resilience targets, ensuring that application resilience can be thoroughly tested and verified.
Finally, for keeping track of application resilience over time, Resilience Hub provides a resiliency score.
The Challenges of Managing Application Resilience at Scale
Managing application resilience across a large portfolio of evolving applications has become increasingly more complex. Today, application components are usually made up of many different resource types and can even span multiple AWS accounts. Ensuring that new and existing applications consistently meet their RPO and RTO targets presents a significant challenge.
GP’s legacy process of preparing and executing a Disaster Recovery (DR) strategy was completely manual. To save time, developers would skip the DR step and deploy infrastructure without review to eliminate steps in the process. Additionally, a majority of GP’s existing cloud infrastructure is based on EC2. The current set of replication tools helped with the legacy DR strategy; however, as they moved to a serverless microservices architecture, they wanted to update the DR strategy to support their modernization effort. Georgia-Pacific was well-aware of these challenges since they had multiple applications across several business units.
Georgia Pacific’s Legacy Disaster Recovery Strategy
- The first step in GP’s DR process included a technical assessment. In this technical assessment, app owners would define the components of their application and all the infrastructure for it, determine the tier of the application, validate that the app could meet the RPO/RTO requirements for that tier, and validate that the app can hit its AZ requirements. Because this process was manual, it was time consuming, and creating more opportunity for errors. Developers could miss some of the components of their AWS infrastructure, and then subsequently miss the validation steps.
- Next, a DR plan would be defined. GP created a blueprint of how to recover the application in a case of an incident which included application recovery steps and Infrastructure recovery steps.
- A tabletop was then performed, in which GP would hold a dry run of their DR plan to get them ready for a functional test. However, due to the assessment and plan having inconsistencies or errors, the tabletops either get extended beyond he required time, or they would have to reschedule it multiple times.
- Finally, there was a functional test, in which they would take a snapshot of production in production and move it to an isolated network, and then do their DR test there. However, there was no way to test serverless apps there.
New Process with Resilience Hub – Phase I
One of the biggest challenges with the existing process was that it was manual. The team spent a considerable amount of time in DR prep, DR executing and testing. The goal was to reduce the time spent in the DR process. To address these challenges, GP decided to use AWS Resilience Hub. With Resilience Hub, GP was able to create a standardized process to catalog and assess the resilience of their applications on AWS. GP defined the RTO and RPO for their applications within Resilience Hub and continually assessed the resilience of their applications.
GP replaced the manual technical assessment step in their DR life cycle with Resilience Hub, which helps them catalog applications by tiers either by using AWS resource groups, Terraform state files, or AWS CloudFormation Stacks (CF). Using Resilience Hub, they now pre-define RTO and RPO of each application tier across all GP accounts. This allows developers to no longer guess those requirements, but instead select them based on the tier of the application selected. Finally, during the assessment, Resilience Hub provides recommendations on how to resolve those issues that comes up, reducing the manual effort that developers spend to remediate issues identified during technical assessment.
Using Resilience Hub, GP has improved the consistency of their technical assessment and DR plan phase. As a result, the team spends less time during the later two phases— tabletop and functional test.
New Process with Resilience Hub – Phase II
In an effort to automate deployment and move towards a DevOps practice, GP added the Resilience Hub to their pre- and post-deployment steps to make sure that all application deployments are complaint and can meet the RTO/RPO requirements of their designated tiers.
The infrastructure team now creates a resilience report showing the resilience scores that they share with leadership to conform compliance. Finally, the team has started integrating AWS Fault Injection Simulator into their environment to help facilitate serverless application testing.
Results
Resilience Hub has increased the speed of the technical assessment phase by 94%. On average, the manual assessment process would take about 90 minutes, but with Resilience Hub, the assessment time was cut to 5 mins. The time spent on overall DR process, include the preparation, the technical assessment, setting up your DR plan, as well as at tabletop has come down by 50%.
Finally, Georgia-Pacific was also able to improve their DR compliance. They can now detect compliance issues in their environment immediately after that technical assessment or during the functional test, rather than wait for their annual DR review. Since GP can do frequent assessments, they are able to catch those changes and make sure that they still adhering to their resiliency policies.