Migration & Modernization
Unleashing Innovation: A Retrospective on The New York Times’ Mainframe to AWS Migration
This article co-authored with: Padmanabha Rao Chillara (The New York Times)
Executive Summary
Amazon Web Services recommends an incremental journey to the cloud, leveraging fast-paced, tool-based modernization approaches to migrate from the mainframe to AWS and then incrementally optimize workloads over time. The New York Times (The Times) selected an automated refactor approach to migrate their COBOL-based Home Delivery Platform, Aristo, to Java on AWS documented in this post, Automated Refactoring of a New York Times Mainframe to AWS with Modern Systems, but what happened next? This blog post explores the evolution and optimization of Aristo since the migration, accomplishments versus proposed improvements, transition of skills to the cloud, and what the ongoing future modernization plans look like.
Introduction
The Times is dedicated to helping people understand the world through on-the-ground expert and deeply reported independent journalism. At its core, their strategy is designed to make The Times an essential daily habit for many millions of people. They are focused on continuing to make their journalism and lifestyle products so valuable at scale that people seek them out directly and build enduring daily habits.
For many years, The Times leveraged an IBM Z mainframe to run a key application named CIS, which offered business critical functionality including: billing, invoicing, customer account management, delivery routing, a product catalog, pricing, and financial reporting. This was expensive to operate in comparison to more modern platforms that evolved within the organization. CIS needed modernization to reduce operating costs and enable the convergence of their digital platform with the home delivery platform.
In 2015, The Times chose to go with an automated conversion approach to retain the core application functionality and critical business logic. They collaborated with Modern Systems, an AWS Partner Network (APN) Select Technology Partner to transform their legacy COBOL-based CIS application into a modern Java-based application which today runs on the AWS Cloud.It was rebranded to the name Aristo. Aristo went into production on AWS in March 2018.
What’s happened since then?
In the first year after the launch, Aristo successfully billed over half a billion dollars in subscription revenue, processed nearly 6.5 million transactions, and continues to route the daily newspaper to subscribers across the United States.
Current architecture
The Aristo high level architecture is depicted in Figure 1. It is considered a “home delivery” data source where data related to home delivery subscriptions and transactions or service requests are stored. Client systems call an API Management Platform to get the data they require. External vendors access Aristo via a front-end UI to perform fulfillment updates and results are stored in the Aristo database. Internal systems send batch feeds to update the subscription data.
Aristo uses a batch scheduler to run jobs that process the data received via APIs, front-end UI, and batch processes. It generates batch feeds that supply the downstream systems with subscription data, product and delivery details, and metadata used to deliver the paper to the customer. Examples of the downstream systems include: billing, payment processing, reporting, and fulfillment.
Before the migration, clients logged in via the front-end CIS to create new orders and service requests. Later, the team enhanced the application by introducing web services. Once all the web service APIs were developed, The Times required clients to start using the APIs via a middleware application instead of CIS Online Screens (Figure 2).
Building features outside of the Aristo code base, the team developed an Internal Sidecar Application (INK Services). INK Services were developed using REST instead of the traditional Aristo SOAP calls, which were difficult to manage and maintain. The supporting changes in INK services made it much easier for the clients to build new features quickly. In this architecture, new clients call INK services which can access the Aristo database directly.
Before the migration, CIS would integrate with an ETL tool to transform metadata for downstream systems. After the migration, The Times enhanced the process, expanding those specific fields that have compressed data, posting those feeds to downstream systems directly, and eliminating ETL processing.
Great Results
The move to AWS and modernized architecture challenged myths that concern organizations migrating away from their mainframe/midrange servers (Figure 3).
- Myth: You can’t reach the same scalability and performance as a mainframe.
- Myth: You can’t get the same reliability without the mainframe.
- Myth: Staff retention will be a challenge if you migrate.
Myth: You can’t reach the same scalability and performance as a mainframe – Busted!
The mainframe was able to handle the performance requirements of The Times’s applications well. When planning the migration, The Times determined that AWS resources would need to match or exceed the performance capabilities of the mainframe. They met those requirements, and additionally increased their ability to see performance metrics and boosted their scalability, all while keeping costs optimized.
Enhanced performance was observed with Aristo’s batch jobs by using preload steps in the jobs. The preload process takes data from database tables and loads it into cache memory. That cache is then used for job operations. At the end of the job, changes are committed to the actual database. This process drastically increased job performance. For example, a “Home Delivery Liability” job that used to run more than 20 hours has been reduced to 4 hours.
Myth: You can’t get the same reliability without the mainframe – Busted!
The Times’s mainframe was considered to be extremely reliable with very little downtime. Any solution moving away from the mainframe needed to match that.
After reviewing availability metrics over time, The Times confirmed that service availability was maintained after the migration.
Myth – Staff retention will be a challenge if you migrate – Busted!
The Times was able to retain staff that supported Aristo on the mainframe. They were given training on using Java for future software development and AWS to help them support the system running in AWS. The Times strives to employ “T-Shaped” engineers, technology professionals with a broad base of understanding in several areas along with a particular area of deep expertise. This helped the team support the Aristo monolithic application and allowed them to develop new features that the business required.
Some Aristo team members were moved to other teams to begin building microservices that will eventually replace some of the existing features of Aristo. Deep knowledge of the Aristo architecture and the business that it supports proved to be of great value in this ongoing effort.
Additional Wins
The Times realized additional wins.
- Improved integration with other internal and third-party applications
- Increased observability
Improved integration
Integrating new applications with Aristo on AWS is easier for The Times than it was while operating the mainframe. Previously, integrations were achieved through a combination of screen scraping techniques and upstream/downstream applications that were non-mainframe systems. ETLs transformed data from Aristo and loaded it into other applications. Prior to the migration, subscription account numbers and amount details were compressed and stored due to space and memory constraints. ETL was used to expand those fields.
Once in AWS, dependent teams began to create microservices to fetch data from the database directly. They discovered that spinning up new microservices was less expensive and faster than developing on and integrating with the mainframe.
The Times also removed a dependency (see star in Figure 4) by taking their billing invoice delivery process in-house. They developed an AWS application for this, further easing integrations with internal applications.
Increased observability leveraging Sumo Logic and Datadog
Previous observability worked for The Times, but that information was about the mainframe. Visualizing data in dashboards was not possible. In AWS, the team’s observability was significantly enhanced with dashboards and other capabilities from third party tools (Figure 5).
- Datadog dashboard to monitor Aristo API requests and error rates.
- Started shipping API Logs, front-end UI logs, batch job logs to Sumo Logic and created Sumo queries to add alerting and monitoring.
- Created a SLA dashboard for the batch jobs.
Challenges
In 2020, there were signs that challenges could surface if additional optimizations did not take place. The areas that required further work and optimization were:
- Filesystem storage retention optimizations.
- Database optimizations: fragmentation, size, consistency.
- Maintainability.
- Operational maturity.
File system storage retention optimizations
The mainframe ran many batch processes that produced multiple versions of files that were megabytes in size. To address this, the team designed new mechanisms for limiting growth of files that were non-essential by reducing their retention periods, moving data to long-term archival storage, and cleaning up files not serving functional needs.
Database optimizations: fragmentation, size, consistency
The migration of mainframe data to a relational database (Oracle) resulted in data fragmentation which slowed down system operations over time. This was not critical initially, but projections showed that an impact could become prominent over time. In response to this concern, the team built a defragmentation tool, a set of batch jobs that would perform defragmentation using Oracle’s native APIs during system downtime.
Similar to the file system issue, database size kept increasing and there were concerns that future performance issues could surface. The team designed and implemented a set of tools to trim and clean up unnecessary data based on a set of functional retention policies carefully designed for its customers.
The database schema design was largely a port of the mainframe file schema. As such, it didn’t follow modern schema normalization practices. This led to consistency issues when there were competing locks from on-demand traffic (e.g., API calls) and scheduled or triggered batch processes. While not critical, this required optimization of the transactional procedures at the code level, thoughtful scheduling of specific processes, and better monitoring and operational readiness.
Maintainability
The team wrote a great deal of documentation to provide visibility into the system’s architecture and specific technical requirements to incoming staff to ensure long-term maintainability. This ranged from conceptual diagrams, to automatically generated documentation from Java code for hundreds of batch jobs, data table schemas, and a parser tool that can read and write file data utilizing mainframe formats.
Operational Maturity
The Times needed to optimize the job schedule to ensure maximum buffer time and spacing between critical jobs and known traffic peaks. This also helped automate known low-severity issues such as server restarts, schema automations (during primary/shadow transitions), decoupling analytical data from operational data, and implementing native-like caching for critical jobs.
What’s next?
Migrating to Amazon RDS PostgreSQL to reduce cost while retaining performance
Moving to AWS has increased The Times’s database engine options while simultaneously reducing the operational overhead required to manage them. They are working on moving to Amazon RDS PostgreSQL for key Aristo databases which are currently served by Oracle on Amazon EC2, reducing costs while retaining the features and performance they require. The Times is using AWS Database Migration Service for this in a unique way, which we will be including in an upcoming blog post.
Conclusion
The Times’s refactoring of CIS into a Java application and the migration to AWS has proven to be a success and has introduced more options for future modernization. Key myths about scalability, reliability, and staff were busted as Aristo matched or exceeded the capabilities of the mainframe environment. Additional wins like improved integration, better observability, and optimized database performance further validated the migration.