AWS Big Data Blog

Category: Advanced (300)

Migrate Delta tables from Azure Data Lake Storage to Amazon S3 using AWS Glue

Organizations are increasingly using a multi-cloud strategy to run their production workloads. We often see requests from customers who have started their data journey by building data lakes on Microsoft Azure, to extend access to the data to AWS services. Customers want to use a variety of AWS analytics, data, AI, and machine learning (ML) […]

Evaluating sample Amazon Redshift data sharing architecture using Redshift Test Drive and advanced SQL analysis

In this post, we walk you through the process of testing workload isolation architecture using Amazon Redshift Data Sharing and Test Drive utility. We demonstrate how you can use SQL for advanced price performance analysis and compare different workloads on different target Redshift cluster configurations.

Accelerate data integration with Salesforce and AWS using AWS Glue

To meet the demands of diverse data integration use cases, AWS Glue now supports SaaS connectivity for Salesforce. This enables users to quickly preview and transfer their customer relationship management (CRM) data, fetch the schema dynamically on request, and query the data. This post explores the new Salesforce connector for AWS Glue and demonstrates how to build a modern extract, transform, and load (ETL) pipeline with AWS Glue ETL scripts.

How Kaplan, Inc. implemented modern data pipelines using Amazon MWAA and Amazon AppFlow with Amazon Redshift as a data warehouse

Kaplan, Inc. provides individuals, educational institutions, and businesses with a broad array of services, supporting our students and partners to meet their diverse and evolving needs throughout their educational and professional journeys. In this post, we discuss how the Kaplan data engineering team implemented data integration from the Salesforce application to Amazon Redshift. The solution uses Amazon Simple Storage Service as a data lake, Amazon Redshift as a data warehouse, Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as an orchestrator, and Tableau as the presentation layer.

Optimize cost and performance for Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With Amazon MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance […]

Embed Amazon OpenSearch Service dashboards in your application

Customers across diverse industries rely on Amazon OpenSearch Service for interactive log analytics, real-time application monitoring, website search, vector database, deriving meaningful insights from data, and visualizing these insights using OpenSearch Dashboards. Additionally, customers often seek out capabilities that enable effortless sharing of visual dashboards and seamless embedding of these dashboards within their applications, further […]

DataZone High Level Architecture

Implement data quality checks on Amazon Redshift data assets and integrate with Amazon DataZone

In this post, we show how to capture the data quality metrics for data assets produced in Amazon Redshift. With Amazon DataZone, the data owner can directly import the technical metadata of a Redshift database table and views to the Amazon DataZone project’s inventory. As these data assets gets imported into Amazon DataZone, it bypasses the AWS Glue Data Catalog, creating a gap in data quality integration. This post proposes a solution to enrich the Amazon Redshift data asset with data quality scores and KPI metrics.

Build a serverless data quality pipeline using Deequ on AWS Lambda

Poor data quality can lead to a variety of problems, including pipeline failures, incorrect reporting, and poor business decisions. For example, if data ingested from one of the systems contains a high number of duplicates, it can result in skewed data in the reporting system. To prevent such issues, data quality checks are integrated into […]

Amazon Managed Service for Apache Flink State Transition

Improve the resilience of Amazon Managed Service for Apache Flink application with system-rollback feature

This post explores how to use the system-rollback feature in Managed Service for Apache Flink.We discuss how this functionality improves your application’s resilience by providing a highly available Flink application. Through an example, you will also learn how to use the APIs to have more visibility of the application’s operations.

Use AWS Glue to streamline SFTP data processing

In this blog post, we explore how to use the SFTP Connector for AWS Glue from the AWS Marketplace to efficiently process data from Secure File Transfer Protocol (SFTP) servers into Amazon Simple Storage Service (Amazon S3), further empowering your data analytics and insights.