AWS Partner Network (APN) Blog
Operational Analytics with MongoDB Atlas and Amazon Redshift
By Igor Alekseev, Partner Solutions Architect, Data and Analytics – AWS
By Babu Srinivasan, Sr. Partner Solutions Architect – MongoDB
By Vittal Pai, Sr. Partner Solutions Architect – MongoDB
MongoDB |
Enterprises are building data analysis capabilities to obtain actionable insights from their data, develop an understanding of their business, and channel efforts towards customer centricity.
The process of extracting information and analyzing data is at the core of operational analytics. This analysis reduces large datasets to a few key summary statistics, reveals hidden patterns or relationships, or involves implementing methods of rendering that aid human interpretation.
This post explains the need for operational analytics and how it can be achieved with MongoDB Atlas and Amazon Redshift.
MongoDB is an AWS Data and Analytics Competency Partner and developer data platform company empowering innovators to unleash the power of software and data.
Modern Demands
It’s imperative for organizations to be data-driven in order to grow and succeed. Organizations often lack data strategy and struggle to realize value from the data, having it locked in data silos.
Modern enterprises need a more comprehensive understanding of their customers in order to make better informed business decisions. They need to constantly evolve their analytics solutions to meet these changing needs.
To cope with the volume and variety of data, extra data layers are introduced for data storage, mobile applications, database search, and data analytics. This approach can lead to multiple data copies and requires duplicated integration code to satisfy business needs.
Breaking down silos across multiple domains is an expensive task that impacts organizations’ agility by introducing increased operational overhead.
Amazon Redshift makes it fast, simple, and cost-effective to analyze data using standard SQL queries and your existing business intelligence (BI) tools. It integrates with other AWS services such as Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to take advantage of all the analytic capabilities in the Amazon Web Services (AWS) cloud.
MongoDB Atlas is a multi-cloud, developer data platform that combines transactional processing, relevance-based search, real-time analytics, and mobile-to-cloud data synchronization in an elegant and integrated architecture.
Operational Data Layer with Atlas + Amazon Redshift
Figure 1 below shows an operational data analytics warehouse is comprised of MongoDB Atlas and Amazon Redshift. This architecture enables businesses to meet modern data and analytics demands by providing day-to-day operational data processing and comprehensive data warehouse solutions at scale.
Organizations use Amazon Redshift to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Organizations also leverage MongoDB Atlas a cloud-hosted database service that allows to set up, operate, and scale a MongoDB database in the cloud.
Figure 1 – Operational data layer + enterprise data warehouse.
Business Use Cases
Most organizations need business intelligence built with their day-to-day operational data. Industries such as retail, banking and finance, telecommunications, manufacturing, and supply chain management utilize operational analytics to improve customer satisfaction and meet their business needs.
For example, the retail industry needs to track inventory, optimize omnichannel distribution and online marketing strategies, figure out optimal product bundles, manage warehouse inventory, and forecast future demand. While some of these tasks can be accomplished with basic analytics, forecasting would involve more advanced analytics leveraging artificial intelligence (AI) and machine learning (ML) models.
In the financial industry, there are similar challenges: marketing, bundling, and omnichannel management. At the same time, financial institutions have to deal with the challenges imposed by the nature of the financial products, such as reducing customer churn while trying to protect themselves against financial fraud.
In telecom, going beyond foundational analytics around billing and customer management companies also need to deal with customer churn, product bundling, and online marketing. At the same time, newest-generation wireless broadband infrastructure relies on software-defined networks, requiring real-time advanced analytics.
The automotive industry is quickly evolving to deliver an ever-increasing number of services to car owners and passengers. Modern cars have a large array of censors storing and analyzing data locally and remotely. Car maintenance relies on predictive analytics to proactively detect problems with various car components. Customers expect help when collisions are detected and lower insurance premiums reward them for good driving habits. Many of these problems are solved with custom ML models.
Manufacturing heavily relies on censors to analyze efficiency of the production process. Modern manufacturing often involves complex supply chains which require extra scrutiny due to geopolitical and environmental concerns. Data coming from censors help companies decrease downtime, manage defects while increasing the yield. Having basic dashboards is no longer enough. More advanced analytics in the form of custom-built models allow companies to conduct predictive maintenance reducing equipment failures.
Figure 2 – Business use cases.
Integration Architecture
The data from MongoDB Atlas is migrated to Amazon Redshift in the following ways:
- One-time data load
- Real-time data synchronization
One-Time Data Load
There are a number of reasons why a business may need to perform a one-time data load. For example, a company may have acquired a new business and needs to load the data from the acquired company into its own systems. Alternatively, a company may be migrating its data to a new system and needs to load all of its existing data into the new system.
In both cases, a one-time data load is necessary to ensure all of the relevant data is accurately and efficiently transferred and integrated.
Operational Data Layer (ODL) and Enterprise Day Layer (EDL) patterns are focused on supporting the needs of operational systems and providing fast and reliable access to data. The data warehouse pattern is focused on supporting the needs of BI and analytics applications and providing a foundation for reporting and analysis.
Figure 3 – One-time data load architecture.
In the diagram above, the operational data layer (ODL) is an architectural pattern in which source systems or data producers are decoupled from consuming systems by an introduction of MongoDB Atlas as an additional data layer.
ODL can hide complexity of the legacy data sources while simplifying operational data access from consuming systems. It provides an additional level of abstraction and helps modernizing legacy or siloed data sources.
The data loading is achieved using an Apache Spark-based process. The Spark framework is well integrated with both Amazon EMR, ASWS Glue Studio, and MongoDB Atlas.
The MongoDB Connector for Spark provides connectivity to MongoDB from Apache Spark. A Spark job can be run either on the Amazon EMR cluster or through AWS Glue Studio. The job accomplishes data transfer by connecting a MongoDB Atlas cluster as its source and writing to Amazon Redshift cluster as its target.
Real-Time Data Synchronization
Real-time data synchronization needs to happen immediately following the one-time load process. This can be achieved in multiple ways, as shown below.
Figure 4 – Change stream data load architecture.
MongoDB Atlas provides the change data capture (CDC) feature to track all changes. The data is first extracted to an Amazon Simple Storage Service (Amazon S3) bucket, and then it’s either transformed and loaded to Redshift or directly loaded to Redshift as staging data.
- Extract: The extraction of data from Atlas can be achieved by two methods. One by using Amazon Managed Streaming for Apache Kafka (Amazon MSK) or by utilizing Atlas Application Services and Atlas Data Federation. Both methods are explained in detail below.
- Transform and load: Using AWS Glue Studio, data from the S3 bucket are transformed and loaded to Redshift.
- Load and transform: Using Amazon Redshift Spectrum, data can be loaded to Redshift as an external table. Upon getting the data into Redshift, it’s transformed to the required template.
Below are a few models for the extract, transform, and load (ETL) process.
ETL Data Using AWS Glue and Amazon MSK
The diagram below shows how consumer applications generate operational data through internal apps, customer-facing services, and APIs.
The CDC feature in MongoDB Atlas is utilized to capture the data changes written to the S3 bucket using Amazon MSK.
Figure 5 – Change stream data load architecture with Amazon MSK and AWS Glue.
MongoDB Atlas is integrated with Amazon MSK for capturing the change data streams. S3 buckets are used to store the changed data stream and become the source for the subsequent AWS Glue jobs.
AWS Glue is a serverless data integration service that’s used to discover, prepare, and combine data to Redshift for analytics and machine learning (ML). Redshift analyzes data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning to deliver the best price-performance at any scale.
Amazon Redshift ML is used to create, train, and apply machine learning models using familiar SQL commands in Redshift data warehouses. An Amazon SageMaker model is called using the SQL commands. This method leverages the auto scaling features of Amazon MSK and AWS Glue to support enterprise-level workloads.
ETL Data Using AWS Glue and Atlas Triggers
The data from MongoDB Atlas are stored in an S3 bucket using the Atlas Application Services and Triggers. AWS Glue jobs and Atlas Triggers are used to migrate the data from the S3 bucket to Redshift, as shown below.
Figure 6 – Change stream data load architecture with AWS Glue and MongoDB Atlas Triggers.
Atlas Data Federation is used to consolidate the different siloed data into a single destination. The $out operator of MongoDB is utilized to write the data from MongoDB Atlas to the S3 bucket.
Atlas Triggers are used to trigger the federated query, which writes the Atlas cluster data to S3 on a defined frequency (every minute, for example).
This is a cost-effective method for comparatively lesser data volumes, and it meets the needs for complete traceability and backup of the data. The events can be re-triggered in case of failures in the system.
Amazon Redshift Spectrum – External Tables
In this method, Amazon Redshift Spectrum is used in place of AWS Glue.
Amazon Redshift Spectrum is used to query and retrieve structured and semi-structured data from files in S3 without having to load the data into Redshift tables. Amazon Redshift Spectrum queries employ massive parallelism to run very fast against large datasets.
This method can be used in scenarios where the data needs to be managed externally to Redshift, as shown in diagram below. For example, this happens in data lakes where data is accessed by multiple analytics engines with Redshift Spectrum being one of them.
Another advantage is due to the fact the storage and compute are decoupled here, allowing for scenarios where large volumes of data need to be stored cost-efficiently while retaining ability for ad-hoc queries. This is common in industries with regulatory long retention requirements. Generally speaking, this pattern applies in cases of high-volume, low-value data needs to be stored and accessed periodically.
Figure 7 – Change stream data load architecture with Amazon Redshift Spectrum.
Business Analytics
Redshift can easily be integrated with the leading analytics tools like Amazon QuickSight, Power BI, Tableau, and more.
Machine learning models easily can be utilized using Amazon Redshift ML features, which simplifies training the pre-existing data models created through Amazon SageMaker with the operational data utilizing their SQL queries.
Summary
With the synergy created by using MongoDB Atlas for its operational efficiency and Amazon Redshift for its data warehousing excellence, we can develop operational analytics for any type of business needs.
This solution can be extended to integrate with extensive artificial intelligence and machine learning needs through the use of Amazon SageMaker.
To learn more, refer to the Atlas_to_Redshift GitHub repository for step-by-step instructions and sample code. You can also try MongoDB Atlas for free on AWS Marketplace.
MongoDB – AWS Partner Spotlight
MongoDB is an AWS Data and Analytics Competency Partner and developer data platform company empowering innovators to unleash the power of software and data.