Advanced (300) | AWS Big Data Blog

Use trusted identity propagation for Apache Spark interactive sessions in Amazon SageMaker Unified Studio

In this post, we provide step-by-step instructions to set up Amazon EMR on EC2, EMR Serverless, and AWS Glue within SageMaker Unified Studio, enabled with trusted identity propagation. We use the setup to illustrate how different IAM Identity Center users can run their Spark sessions, using each compute setup, within the same project in SageMaker Unified Studio. We show how each user will see only tables or part of tables that they’re granted access to in Lake Formation.

Accelerate data governance with custom subscription workflows in Amazon SageMaker

Organizations need to efficiently manage data assets while maintaining governance controls in their data marketplaces. Although manual approval workflows remain important for sensitive datasets and production systems, there’s an increasing need for automated approval processes with less sensitive datasets. In this post, we show you how to automate subscription request approvals within SageMaker, accelerating data access for data consumers.

Implement fine-grained access control for Iceberg tables using Amazon EMR on EKS integrated with AWS Lake Formation

On February 6th 2025, AWS introduced fine-grained access control based on AWS Lake Formation for EMR on EKS from Amazon EMR 7.7 and higher version. You can now significantly enhance your data governance and security frameworks using this feature. In this post, we demonstrate how to implement FGAC on Apache Iceberg tables using EMR on EKS with Lake Formation.

Unlock real-time data insights with schema evolution using Amazon MSK Serverless, Iceberg, and AWS Glue streaming

This post showcases a solution that businesses can use to access real-time data insights without the traditional delays between data creation and analysis. By combining Amazon MSK Serverless, Debezium MySQL connector, AWS Glue streaming, and Apache Iceberg tables, the architecture captures database changes instantly and makes them immediately available for analytics through Amazon Athena. A standout feature is the system’s ability to automatically adapt when database structures change—such as adding new columns—without disrupting operations or requiring manual intervention.

Upgrade from Amazon Redshift DC2 node type to Amazon Redshift Serverless

In this post, we show you the upgrade process from DC2 instances to Amazon Redshift Serverless. By using Amazon Redshift Serverless, you can run and scale analytics without managing data warehouse infrastructure.

Stifel’s approach to scalable Data Pipeline Orchestration in Data Mesh

Stifel Financial Corp, a diversified financial services holding company is expanding its data landscape that requires an orchestration solution capable of managing increasingly complex data pipeline operations across multiple business domains. Traditional time-based scheduling systems fall short in addressing the dynamic interdependencies between data products, requires event-driven orchestration. Key challenges include coordinating cross-domain dependencies, maintaining data consistency across business units, meeting stringent SLAs, and scaling effectively as data volumes grow. Without a flexible orchestration solution, these issues can lead to delayed business operations and insights, increased operational overhead, and heightened compliance risks due to manual interventions and rigid scheduling mechanisms that cannot adapt to evolving business needs. In this post, we walk through how Stifel Financial Corp, in collaboration with AWS ProServe, has addressed these challenges by building a modular, event-driven orchestration solution using AWS native services that enables precise triggering of data pipelines based on dependency satisfaction, supporting near real-time responsiveness and cross-domain coordination.

Automate email notifications for governance teams working with Amazon SageMaker Catalog

In this post, we show you how to create custom notifications for events occurring in SageMaker Catalog using Amazon EventBridge, AWS Lambda, and Amazon SNS. You can expand this solution to automatically integrate SageMaker Catalog with in-house enterprise workflow tools like ServiceNow and Helix.

Configure seamless single sign-on with SQL analytics in Amazon SageMaker Unified Studio

This post demonstrates how to configure SageMaker Unified Studio with SSO, set up projects and user onboarding, and access data securely using integrated analytics tools.

Best practices for upgrading from Amazon Redshift DC2 to RA3 and Amazon Redshift Serverless

As analytical demands grow, many customers are upgrading from DC2 to RA3 or Amazon Redshift Serverless, which offer independent compute and storage scaling, along with advanced capabilities such as data sharing, zero-ETL integration, and built-in artificial intelligence and machine learning (AI/ML) support with Amazon Redshift ML. This post provides a practical guide to plan your target architecture and migration strategy, covering upgrade options, key considerations, and best practices to facilitate a successful and seamless transition.

Building a real-time ICU patient analytics pipeline with AWS Lambda event source mapping

In this post, we demonstrate how to build a serverless architecture that processes real-time ICU patient monitoring data using Lambda event source mapping for immediate alert generation and data aggregation, followed by persistent storage in Amazon S3 with an Iceberg catalog for comprehensive healthcare analytics.

AWS Big Data Blog

Category: Advanced (300)