AWS Big Data Blog
Category: Technical How-to
Enhance governance with metadata enforcement rules in Amazon SageMaker
Amazon SageMaker Catalog now supports metadata rules allowing organizations to enforce metadata standards across data publishing and subscription workflows. In this post, we guide you through two workflows: setting up metadata enforcement rules for a specific domain and publishing an asset or data product in a catalog, and setting up metadata enforcement rules for a specific domain and subscribing to an asset or data product that is owned by a project within that domain.
Build multi-Region resilient Apache Kafka applications with identical topic names using Amazon MSK and Amazon MSK Replicator
This post explains how to use MSK Replicator for cross-cluster data replication and details the failover and failback processes while keeping the same topic name across Regions.
Connect, share, and query where your data sits using Amazon SageMaker Unified Studio
In this blog post, we will demonstrate how business units can use Amazon SageMaker Unified Studio to discover, subscribe to, and analyze these distributed data assets. Through this unified query capability, you can create comprehensive insights into customer transaction patterns and purchase behavior for active products without the traditional barriers of data silos or the need to copy data between systems.
Architect fault-tolerant applications with instance fleets on Amazon EMR on EC2
In this post, we show how to optimize capacity by analyzing EMR workloads and implementing strategies tailored to your workload patterns. We walk through assessing the historical compute usage of a workload and use a combination of strategies to reduce the likelihood of InsufficientCapacityExceptions (ICE) when Amazon EMR launches specific EC2 instance types. We implement flexible instance fleet strategies to reduce dependency on specific instance types and use Amazon EC2 On-Demand Capacity Reservation (ODCRs) for predictable, steady-state workloads. Following this approach can help prevent job failures due to capacity limits while optimizing your cluster for cost and performance.
Develop and test AWS Glue 5.0 jobs locally using a Docker container
In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0.
Cross-account data collaboration with Amazon DataZone and AWS analytical tools
In this post, we will cover how you can use Amazon DataZone to facilitate data collaboration between AWS accounts.
Design patterns for implementing Hive Metastore for Amazon EMR on EKS
In this post, we explore the design patterns for implementing the Hive Metastore (HMS) with EMR on EKS with Spark Operator, each offering distinct advantages depending on your requirements. Whether you choose to deploy HMS as a sidecar container within the Apache Spark Driver pod, or as a Kubernetes deployment in the data processing EKS cluster, or as an external HMS service in a separate EKS cluster, the key considerations revolve around communication efficiency, scalability, resource isolation, high availability, and security.
Governing streaming data in Amazon DataZone with the Data Solutions Framework on AWS
In this post, we explore how AWS customers can extend Amazon DataZone to support streaming data such as Amazon Managed Streaming for Apache Kafka (Amazon MSK) topics. Developers and DevOps managers can use Amazon MSK, a popular streaming data service, to run Kafka applications and Kafka Connect connectors on AWS without becoming experts in operating it.
Amazon Prime Video advances search for sports using Amazon OpenSearch Service
In this post, we will walk you through how Prime Video used Amazon OpenSearch Service and its AI and machine learning (AI/ML) capabilities to build a more intuitive and enhanced sports search experience.
Enhance your workload resilience with new Amazon EMR instance fleet features
Amazon EMR has introduced new features for instance fleets that address critical challenges in big data operations. This post explores how these innovations improve cluster resilience, scalability, and efficiency, enabling you to build more robust data processing architectures on AWS.