AWS for Industries

Downdetector Enhances Resilience with AWS Multi-Region Serverless Architecture

In today’s digital era, ensuring the resilience and availability of online services is critical. When service disruptions occur, they can have far-reaching impacts across various industries, including advertising and marketing. Downdetector helps minimize the impact on advertising campaigns, data analytics, and customer engagement initiatives, ensuring seamless digital experiences for consumers.

Downdetector by Ookla, provides real-time insights into the status of services, enabling professionals, in various industries, to quickly identify and respond to potential disruptions. Downdetector leverages the comprehensive suite of Amazon Web Services (AWS) to act as a reliable digital first responder. This robust infrastructure allows Downdetector to handle large volumes of traffic and reports, scaling rapidly while processing and delivering data in near real-time. Given its role as a trusted information source during service disruptions, Downdetector’s architecture must remain resilient and highly available, even during major incidents affecting other services.

To meet these demands, Downdetector’s architecture is built on serverless principles. This approach ensures scalability and agility without the burden of managing servers. By adopting a serverless infrastructure, Downdetector maintains its commitment to resilience and reliability, ensuring continuous availability and performance under the most demanding circumstances.

The Single-Region Starting Point

Figure 1- Previous single region architecture (simplified)Figure 1: Previous single region architecture (simplified)

Downdetector’s original architecture (Figure 1) in AWS was simple: a single-region Amazon Aurora instance coupled with an Amazon OpenSearch Service cluster. This straightforward approach had its merits, particularly in data reprocessing—once Aurora was updated, changes would propagate to OpenSearch Service.

Downdetector needed to be able to scale at a moment’s notice—never knowing when the next major service disruption might occur. This one reason is why the architecture was designed to be able as serverless as possible.

AWS Lambda was chosen because it has the ability to scale at a moment’s notice and is highly flexible in processing data. It is used for the ingestion pipeline, data processing, and APIs. Amazon Kinesis is leveraged for data queues, and used Amazon DynamoDB for a caching layer in front of Downdetector’s datastores for quick lookups at scale. Hot data (up to 24-hours) lives in an Aurora MySQL instance for historic reasons and because it is straightforward to work with for Downdetector’s engineers and data specialists.

All data is ingested into an OpenSearch Service cluster, this is the primary data source for querying. OpenSearch Service was chosen because it allows optimization of data for both scalability and regionality (geo-based queries), along with other querying features. Amazon OpenSearch Service allows Downdetector to optimize their data storage and availability—having their primary data being highly available with their older data optimized for storage efficiency, while also being available on demand. They leverage warm and cold data storage, multiple index querying and data lifecycles to make this transparent for their engineers and end users.

AWS Fargate on ECS is used for its developer experience and ability to scale. When Downdetector compared their scaling from the previous EC2 instances to using AWS Fargate, they lowered our ability to spin up new instances from minutes to seconds.

Evolving to Multi-Region Active-Active

Figure 2- Overview of the multi-region architecture (simplified)Figure 2: Overview of the multi-region architecture (simplified)

Downdetector’s evolution to a multi-region active-active architecture represents a strategic transformation aimed at enhancing their cloud resilience. The enhancements focus on:

  1. High Availability: With a multi-region architecture, an application is simultaneously hosted in multiple regions. Even if one region experiences downtime due to maintenance or unexpected issues, the application remains accessible in other regions, resulting in uninterrupted services for users. This high availability is crucial in retaining users and maintaining a positive user experience.
  1. Flexible Scalability: A multi-region active-active setup is highly scalable. Traffic can be distributed evenly across all regions, thereby reducing the load on each server and increasing the overall capacity of the application. Moreover, when user demand increases in a specific area, server resources from other regions can be leveraged to balance the load and ensure efficient performance. Basically, the elastic nature of the AWS Cloud infrastructure underpins Downdetector’s ability to expand seamlessly, delivering consistent performance no matter the demand spikes.
  1. Quick Disaster Recovery: In case of a region-specific disaster or outage, the active-active setup allows quick recovery as the active instances in unaffected regions can take over the workload of the affected ones. This flexibility greatly reduces the impact of a disaster and ensures seamless operations.
  2. Reduced Latency: Hosting the application in multiple geographic locations allows for reduced latency for users. They are routed to the closest available server, enhancing the speed and performance of service delivery. The AWS global network plays a pivotal role here, with content delivery optimized through the AWS expansive infrastructure to bring users and services closer than ever before.
  3. Seamless Maintenance: In a multi-region active-active architecture, you can bring down one region for scheduled maintenance without affecting users, as requests will be served by other regions. Thanks to the modular design encouraged by AWS, Downdetector’s systems are structured for agility, enabling maintenance with zero downtime for users.

Technical Innovations, Solution Architecture

To meet Downdecetor’s new requirements, they evaluated their existing architecture and made significant improvements. After some testing, they found that the Aurora multi-region service was the best option for making their architecture scale. Doing so allowed them to go from one region to a secondary region, as well as providing the opportunity to do multi-region rollouts across the globe in the future. With the global database features of Aurora, Downdetector is equipped to handle cross-regional data replication with grace, ensuring data cohesion and integrity.

In designing Downdetector’s system, they’ve emphasized simplicity. Each region operates independently without relying on others, ensuring a streamlined setup. For data synchronization they use the write-forwarding capability of Aurora, a feature that allows data to be replicated across regions effortlessly. This means the application seamlessly handles data without needing to know the specifics of where the data writing occurs. All the complex networking is managed by AWS behind the scenes, eliminating the need for complicated virtual private cloud (VPC) configurations.

Thanks to Downdetector’s design, which is compatible with eventual consistency—a model where all data copies will become consistent over time—they’ve leveraging the ‘eventual’ consistency setting in Amazon Aurora for faster data replication. This leads to quicker write times and minimizes latency. Each local region uses the existing reindexer process to synchronize data from Aurora into the local OpenSearch Service cluster, so the data will be eventually consistent across the regions.

Disaster Recovery

Because Downdetector is running an active-active setup, if a region goes down, the Aurora cluster in another region automatically takes over as the primary cluster. All traffic then automatically shifts over because of the failing health checks to the original region.

The broken region is cut off from traffic while it’s recovering. The Fargate clusters in the secondary region would spin up more instances based on the higher demand, and the Lambda processing units would increase automatically—adjusting to the higher level of reports coming in.

This would keep the recovery point objective (RPO) and recovery time objective (RTO) as close to zero as possible. Some data would likely remain in the queues in the failing region, but it would be ingested automatically once the region is back online. The data is eventually made consistent between the regions. In this scenario, minimal data loss occurs. Downdetector has the ability to test this scenario. They can schedule a test scenario and leverage the Aurora Switchover feature to achieve an RPO and RTO of zero or near zero.

Deployment

Figure 3- Deployment processFigure 3: Deployment process

For Downdetector’s deployment process they leverage AWS CodePipeline that uses AWS CodeBuild and AWS CodeDeploy across all regions, set to activate upon a push or merge to Downdetector’s code repositories. The pipelines are configured to be attuned to Downdetector’s main branch but also to regional branches designated as ‘region/[eu-west-1|us-west-2|…]’. This nuanced approach allows them to roll out region-specific features or updates, enabling targeted deployments that can be tested locally without impacting the overall service continuity.

Routing for Optimized Delivery

Figure 4- Screenshot of Latency map in Amazon Route53Figure 4: Screenshot of Latency map in Amazon Route53

Amazon Route 53 traffic policies are employed to intelligently route user requests. By directing traffic to the nearest AWS region. This significantly reduces latency, enhancing the user experience with faster access to Downdetector’s services.

Conclusion

The advancements in Downdetector’s architecture has created a pivotal resource for users seeking clarity during online service outages. By adopting a serverless framework, they’ve developed an architecture that responds swiftly to real-time demands and gracefully handles intense traffic spikes during major disruptions.

The shift from a single-region setup to a multi-region active-active configuration marks a significant leap forward, enhancing the robustness and responsiveness of their services. This strategic move has fortified Downdetector with increased availability, scalable infrastructure, and stronger disaster recovery processes—all while minimizing latency and streamlining maintenance efforts. A key player in this transformation has been the Amazon Aurora multi-region feature, which has allowed them to ensure that their data remains synchronized across different regions effectively.

While Downdetector aims for seamless service availability, they are prepared for every contingency. Should a regional failover be necessary, their multi-region architecture ensures that Downdetector remains the reliable platform that users can count on during outages. With AWS as the backbone, Downdetector stands ready to provide immediate and accurate service status updates, maintaining business continuity and supporting the critical needs of the advertising and marketing technology landscape.

Contact an AWS Representative to know how we can help accelerate your business.

To learn more about Downdetector use their Contact Us – General Inquires.

Further Reading

___

Ookla

Ookla is a global leader in connectivity intelligence that provides consumers, businesses, and other organizations with data-driven insights to improve networks and connected experiences. We help our clients efficiently solve their biggest connectivity challenges and drive forward innovation. Ookla is a division of Ziff Davis (NASDAQ: ZD), a vertically focused digital media and internet company whose portfolio includes leading brands in technology, entertainment, shopping, health, cybersecurity, and martech. Ookla’s world-renowned brands include Speedtest, Downdetector, Ekahau, RootMetrics, and more.

Sander van de Graaf

Sander van de Graaf

Sander van de Graaf is Principal Architect at Ookla. Sander co-founded Downdetector and has 20+ years of experience at making systems scale and engineering thrive. Since joining Ookla, he has worked on diverse technical challenges, ranging from product innovation, AI/ML, data ingestion, engineering workflows and platform design.

Pedram Jahangiri

Pedram Jahangiri

Pedram Jahangiri is an Enterprise Solution Architect at AWS with a PhD in Computer and Electrical Engineering. With over 15 years of expertise in the Cloud, Operational Technology (OT), IT, AI/ML, and Energy industries, he has a solid history of leading technical teams and developing strategic initiatives at AWS. He is also a distinguished speaker and author, known for his contributions to cloud, energy, and AI/ML technologies.